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PREFACE 


This  Memorandum  is  intended  primarily  to  help  fill  a  need  in  the 
array  of  statistical  tools  now  in  conmon  use  throughout  the  Air  Force 
cost  analysis  comounity.  For  the  past  five  years,  for  example,  the 
growth  in  the  use  of  regression  analysis  has  been  very  rapid.  More 
importantly,  the  sophistication  and  understanding  with  which  the  sta¬ 
tistical  mechanics  are  being  applied  is  also  growing.  There  has, 
however,  been  very  little  use  of  probability  sampling  in  Air  Force 
cost  analysis,  although  many  have  recognized  the  possible  utility  of 
sampling  applications. 

This  Memorandum  was  prepared  at  the  request  of  cost  analysts  at 
both  Headquarters  United  States  Air  Force  and  major  air  command  levels. 
For  the  most  part,  it  represents  the  distillation  of  material  from 
available  sources  (see  Bibliography).  The  intent  is  to  provide  an 
introduction  to  sampling  methods,  using  the  most  applicable  features 
of  several  recognized  sampling  techniques.  It  is  hoped  that  this 
document  will  provide  some  basic  understanding  and  encouragement, 
leading  to  more  widespread  application  in  military  cost  analysis. 
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SUMMARY 


In  this  Mtmo.anuum,  various  aspects  of  probability  sampling  are 
discussed  with  a  view  toward  supplementing  the  tool-kit  of  the  mili¬ 
tary  cost  analyst.  Beginning  with  a  discussion  on  the  relative  merits 
of  the  sample  as  a  means  of  data  collection,  the  paper  moves  to  a 
simplified  treatment  of  sampling  theory,  and  of  the  more  basic  techni¬ 
ques  of  sample  design  and  estimation.  Attention  is  given  throughout 
to  the  use  of  cost-effectiveness  criteria  in  choosing  among  alterna¬ 
tive  sampling  plans.  Sufficient  coverage  is  provided  to  guide  simple 
survey  investigations,  ana  a  bibliography  is  provided  for  further 
reference.  The  exposition  assumes  at  least  a  limited  familiarity  with 
statistical  theory  (as,  for  example,  might  be  provided  in  the  Air 
Force  Institute  of  Technology  training  programs);  accordingly,  many 
concepts  and  definitions  are  given  only  salutory  treatment. 

Cost  analysts  rely  heavily  on  data  that  are  often  imperfectly 
defined.  Existing  data  sources  are  often  fraught  with  errors  of  ob¬ 
servation,  reporting  errors,  and  errors  of  classification.  Sampling 
method  offers  an  approach  to  the  data  quality  problem  that  is  usually 
cheaper,  faster,  and  more  flexible  than  attempts  to  modify  existing 
massive  data  collection  systems.  A  sample  is,  of  course,  also  subject 
to  error  in  that  it  cnly  represents  some  fraction  of  the  total;  the 
different?  is  chat,  with  proper  procedures,  the  magnitude  of  this  kind 
of  error  (i.e.,  sampling  error)  can  be  objectively  estimated  from  the 
sample  itself. 

The  basic.  motive  underlying  the  design  of  a  sampling  scheme  is  to 
minimize  sampling  error  for  a  given  cost,  or  alternatively,  to  minimize 
costs  for  a  given  allowable  sampling  error.  In  either  case,  the  solu¬ 
tion  to  the  design  problem  depends  on  the  particular  behavior  under 
study  and  the  amount  of  prior  information  available.  Some  basic  "tools" 
that  the  analyst  has  at  his  disposal  ar?  stratification,  clustering, 
subsampling,  systematic  sampling,  r*»tlo  and  regression  estimators,  and 
sampling  with  unequal  probabilities.  Although  some  design  problems 
may  find  optimum  solutions  in  rather  complicated  combinations  of  these 
tools,  such  su-called  "complex"  samples  usually  sacrifice  the  virtue 
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of  objectively  estimable  sampling  error  (at  least  with  the  current 
state-of-the-art  of  sampling  theory).  To  some  extent,  this  inadequacy 
also  exists  in  using  sampled  data  for  regression  analyses,  although  it 
appears  that  in  this  case  the  problem  can  be  merely  circumvented. 
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I.  INTRODUCTION 


In  1589 ,  Galileo  Galilei  tossed  a  couple  of  weights  from  the  top 
of  the  tower  at  Pisa  and  made  the  remarkable  observation  that  they 
both  landed  at  the  same  time.  He  did  not  find  it  necessary  to  drag 
every  movable  object  in  Pisa  to  the  top  of  the  tower  for  similar  dis¬ 
position;  inductive  logic  led  him  to  conclude  that  all  objects ,  regard¬ 
less  of  mass,  are  equally  accelerated  by  the  earth's  gravity. 

Ten  years  ago,  a  sampling  expert  carefully  selected  a  hundred 
oranges  on  a  hundred  different  trees  and  successfully  estimated  the 
juice  content  of  the  entire  Florida  orange  crop  within  2^  percent. 

The  usual  method  of  estimation,  a  canvass  of  growers'  expectations, 
was  typically  off  7^  percent. 

In  1936,  Literary  Digest's  pre-election  poll  predicted  an  easy 
victory  for  Alfred  Landon  over  Franklin  Roosevelt.  Roosevelt  won  by 
a  landslide,  carrying  46  of  48  states,  and  Literary  Digest  soon  faded 
from  existence. 

In  reviewing  the  effects  of  airmen  personnel  policies,  the  Air 
Force  relies  on  a  survey  of  airmen  attitudes,  using  a  questionnaire 
sample  of  less  than  one  percent  of  the  total  airmen. 

****** 

A  sample  survey  is  a  vehicle  for  inductive  reasoning;  it  provides 
for  the  transformation  of  observations  of  a  part  into  conclusions  re¬ 
garding  the  whole,  whether  that  whole  be  lead  weights,  oranges,  voters 
cr  airmen  attitudes;  it  can  be  s  very  powerful  device  for  information 
or  misinformation,  depending  on  the  sampler's  adherence  to  good  proce¬ 
dure.  The  Intent  of  this  document  is  to  discuss  a>  -sets  of  good  sam¬ 
pling  procedure  and  how  they  might  be  applied  in  cost  analysis. 

The  pages  that  follow  provide  a  broad  overview  of  sampling  method, 
particularly  as  it  might  apply  to  cost  analysis.  A  large  masher  of 
topics  will  be  touched  upon,  elbelt  briefly.  The  rather  shallow  depth 
will  be  complemented  by  references  to  the  sampling  literature  that  is 
listed  in  the  bibliography.  Simplified  examples  will  be  provided  both 
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to  illustrate  points  and  to  suggest  applications  to  the  various  techni¬ 
ques  discussed.  The  result  is  an  abridged  "primer"  on  sampling  for  an 
audience  of  analysts  who  might  become  involved  in  the  actual  design 
and  implementation  of  a  sample.  There  is  no  pretense  of  providing  a 
short  course  in  sample  theory,  but  to  draw  together  in  summary  differ¬ 
ent  aspects  pertinent  to  sampling  cost  data.  The  reader  will  be  pro¬ 
vided  armament  to  tackle  only  the  simplest  sampling  surveys,  but  perhaps 
he  will  be  encouraged  to  peruse  sampling  literature  of  greater  depth  or 
to  comnunicate  his  needs  to  a  more  experienced  sampling  consultant. 

Since  the  military  cost  analyst  is  typically  concerned  with  sup¬ 
port  of  planning  or  programming  activities,  his  interest  in  data  col¬ 
lected  is  usually  for  input  into  some  forecasting  relationship. 
Nevertheless,  there  is  often  significant  interest  in  simply  assessing 
the  state-of-the-world  through  the  estimation  of  averages,  totals,  and 
ratios.  Except  for  a  small  section  dealing  with  the  use  of  sample 
data  in  regression  analysis,  the  emphasis  of  this  paper  is  on  obtain¬ 
ing  data  and  estimates  that  reflect  current  fact.  Thir  orientation 
should  not  be  thought  of  as  ignoring  the  forecasting  problem  facing  the 
analyst,  but  as  an  attempt  to  limit  the  scope  to  the  problems  of  data 
collection,  which  are  the  same  for  forecasting  as  for  estimating  cur¬ 
rent  totals  and  averages. 

MOTIVATIONS  FOR  SAMPLING 

Cost  analysis  Is  highly  dependent  upon  large  amounts  of  data  which, 
ideally,  are  reliable,  accurate,  and  precisely  defined.  The  required 
data  are  both  historical  and  current,  financial  and  non-f inancial ,  and 
are  often  imperfectly  provided  by  existing  reporting  systems.  It  is, 
of  course,  seldom  economically  and  administratively  expedient  to  sug¬ 
gest  that  the  cost  analyst  go  outside  the  existing  reporting  system 
for  large-scale  data  collection  The  basic  premise  of  this  document 
is  that  there  are  occasions  where  small-scale  methods  could  alleviate 
the  data  quality  problem. 

Consider  one  important  element  of  data  quality,  its  reflection  of 
the  precise  characteristics  to  be  analyzed.  In  a  general  sense,  the 
cost  analyst  is  "end-product"  oriented;  the  typical  frame  of  reference 
is  the  weapon  or  support  system  and  the  activities  and  costs  relating 
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to  it.  Many  reporting  systems,  on  the  other  hand,  reflect  data  in 
organizational,  functional,  or  commodity  terms.  These  data,  while  use- 
ful  for  management  purposes,  may  be  of  no  direct  use  for  cost  analysis 
since  they  are  not  also  coded  to  end  item.  In  the  absence  of  more 
precise  information,  artificial  analytical  means  (such  as  prorating) 
must  be  used  to  infer  the  relationships  between  the  avai  le  data  and 
the  weapon  system,  program  element,  or  other  focus  of  interest.  Sam¬ 
pling  may  provide  a  direct  means  of  obtaining  the  relationship.  A  sample 
survey  can  often  be  used  for  direct  observation  of  work  in  process  at 
a  limited  number  of  sites  where  direct  identification  to  end  product 
is  possible.  Similarly,  a  sample  survey  may  be  designed  which  calls 
for  personnel  within  an  organization  to  keep  supplemental  records  for 
a  short  period  of  time.  Other  samples  may  make  use  of  data  available 
at  the  transaction  level  but  which  arc  summarized  out  of  existence  in 
the  preparation  of  upward  moving  reports.  Whatever  the  exact  content 
of  tne  sample  survey,  the  intent  would  always  be  to  collect  a  relatively 
small  amount  of  data  (by  weapon  system  or  program  element,  etc.)  from 
which  reliable  and  consistent  inferences  about  total  behavior  can  be 
made. 


Sample  or  Census 


One  of  the  obvious  solutions  to  the  cost  analyst's  difficulty  in 
obtaining  the  required  end  product  data  is  the  preparation  of  a  new 
report  which  would  provide  a  continuing  census  of  the  data.  The  fol¬ 
lowing  five  considerations  are  basic  to  the  choice  between  sampling 
and  complete  enumeration. 


(1)  Flexibility.  A  sample  survey  is  not  permanent  and  may  be 
easily  modified  to  reflect  interest  in  different  character¬ 
istics  should  conditions  change.  By  contrast,  a  formal 


The  Resources  Management  Systems  (RMS)  concept  would  in  part 
reduce  the  frequently  large  informational  disparity  between  weapon/ 
support  system  or  program  element  and  functional,  commodity,  and  or¬ 
ganizational  management.  It  will  be  some  time,  however,  before  RMS 
will  have  an  appreciable  effect,  on  the  data  available  to  the  cost 
analyst,  particularly  in  the  operating  area. 
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repcrtlng  system  Is  often  difficult  to  modify  and  often 
continues  to  exist  after  the  need  for  the  data  has  been 
obviated. 

(2)  Cost  and  Available  Resources.  Depending  on  the  nature  of 
the  information  source,  it  is  usually  cheaper  to  secure 
data  from  a  fraction  of  the  aggregate,  allowing  a  rela¬ 
tively  larger  allocation  of  resources  to  the  interpreta¬ 
tion  of  results. 

(3)  Speed.  Similarly,  data  often  can  be  collected  and  summarized 
more  quickly  with  a  sample  than  with  a  complete  count. 

(4)  Scope.  Sampling  may  be  preferable  when  the  purpose  is  to 
study  broad,  aggregate  characteristics.  However,  if  accu¬ 
rate  information  is  wanted  for  many  subcategories,  a  complete 
census  may  be  more  appropriate. 

(5)  Accuracy.  Strangely  enough,  a  sample  may  actually  produce 
more  accurate  results  than  a  census.  Inaccuracy  in  a  census 
may  stem  from  carelessness  in  handling  the  voluminous  data, 
poorly  trained  assistants,  or  the  necessity  to  use  data  col¬ 
lected  by  other  people  for  other  purposes.  Although  a  sample 
deals  with  only  a  portion  of  the  total,  the  data  may  be  much 
more  credible. 

Flexibility  and  speed  are  important  advantages  when  considering 
the  application  of  sampling  for  cost  analysis  data.  Often,  the  data 
required  In  support  of  a  planning  or  programming  study  are  transitory. 
If  cost  analysis  is  to  play  a  role  in  the  study,  usually  there  will  be 
a  premium  on  the  timeliness  of  the  data.  Hence,  a  new  data  collection 
and  reporting  system  is  likely  to  be  of  little  use. 

One  useful  by-product  of  sampling  is  that  it  helps  formalize  the 
analysis  procedure.  It  stimulates  a  rational,  organised  process  of 
inquiry  by  forcing  the  analyst  to  ask  questions  about  objectives, 
scope,  relevant  data,  and  desired  precision. 

Sampling  Computerized  Data 
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cr  cards,  it  would  probably  be  easier  to  program  a  routine  to  summar¬ 
ize  all  the  information  than  to  draw  off  a  representative  sample. 

Even  so,  there  may  be  circumstances  that  suggest  sampling  either  with¬ 
in  or  outside  of  the  existing  system.* 

There  are  two  main  reasons  why  sampling  from  existing  computerized 
data  might  be  desired; 

(1)  When  individual  data  points  arc  to  be  examined  further  for 
qualitative  or  non-re^orted  characteristics,  there  may  not 
be  enough  time  to  deal  with  an  entire  census. 

(2)  The  data  base  may  be  too  bulky  or  complex  to  handle  in  the 
aggregate  even  with  the  data-reduction  capabilities  of  the 
computer.  It  may  therefore  be  necessary  to  sample  in  order 
to  determine  the  most  useful  breakout  for  the  computer  to 
follow  in  summarizing  the  data. 

There  are  three  reasons  why  sampling  outside  existing  systems 
might  be  suggested: 

(1)  It  may  be  desirable  to  generate  a  new  data  base  when  the 
existing  system  is  fraught  with  inaccuracies. 

(2)  Sampled  data  may  be  useful  in  testing  the  credibility  of 
the  system,  and  in  some  cases  may  be  used  for  data  adjust¬ 
ment. 

(3)  There  may  be  no  existing  system  that  provides  the  type  of 
data  needed. 

Depending  on  the  sample  size,  sampling  outside  existing  systems 
(l.e.,  actual  observation  of  the  behavior  under  study)  can  require 
considerable  time  and  expense.  Expense  is  minimized  by  the  proper 
choice  of  sample  design,  which  in  turn  depends  on  many  factors:  allow¬ 
able  error  of  estimate,  allowable  budget,  variability  of  the  behavior 
under  study,  geographic  scope  of  the  study,  etc.  These  topics  will  be 
considered  later. 


IBM  has  developed  some  interesting  ways  to  sample  computerized 
information.  See  Fan,  Muller,  and  fcezucha. 
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Sources  of  Error  In  Existing  Data 

The  inaccuracies  often  found  in  existing  reporting  systems  have 
already  been  briefly  mentioned.  It  should  be  useful  now  to  consider 
the  sources  of  inaccuracy  common  in  mass  data  collection  systems; 
these  should  be  considered  in  planning  a  survey.  The  following  dis¬ 
cussion  is  perhaps  more  speculative  than  objective  since  there  is 
actually  no  available  documentation  of  attempts  to  measure  the  extent 
to  which  reported  data  (cost  or  activity  oriented)  differ  from  fact. 

It  is  often  acknowledged,  however,  by  those  "within  the  trade"  that 
inadequacies  do  exist.  So  what  follows  is  a  categorization  of  reasons 
why  such  problems  occur,  with  no  attempt  to  assess  the  importance  of 
any  particular  source. 

Errors  of  Observation.  These  are  errors  of  measurement  (misread 
gauges,  faulty  calculations,  etc.).  They  arise  from  improper  train¬ 
ing  of  the  data-gatherer  or  inadequate  instruments  of  measurement. 
Compared  to  ocher  errors,  they  are  probably  not  too  important  in  cost 
analysis. 

Reporting  Errors.  These  are  errors  of  omission,  commission,  and 
willful  adjustment  of  observed  information.  They  may  arise  through 
misinterpretation  of  reporting  goals,  or  the  desire  to  make  things 
look  different  than  they  really  are.  Such  manipulation  is  provokea, 
for  example,  by  the  use  of  performance  goals  and  activity  levels  as 
criteria  for  promotions  or  manpower  allocation.  On  the  other  hand, 
reporting  errors  may  be  motivated  by  the  simple  wish  to  avoid  paper¬ 
work. 

Errors  of  Classification  and  Aggregation.  A  classification  error 
occurs  when  some  resource  is  attributed  to  the  wrong  task,  or  category. 
Recent  studies  of  replenishment  spares  consumption  have  shown,  for  ex¬ 
ample,  that  numerous  items  are  misclassified  by  maintenance  shop  per¬ 
sonnel  because  of  carelessness  or  failure  to  use  up-to-date  technical 
manua 1 s . 

Aggregation  error  results  when  expended  resources  are  totaled 
and  reported  at  periodic  Intervals,  rather  than  being  attributed  to 
the  time  periods  in  which  they  were  consumed.  Aggregation  error  of 


7 


another  sort  occurs  when  data  for  several  categories  are  lumped  to- 
gether.  Such  data  may  be  very  appropriate  for  management  purposes , 
but  the  cost  analyst  must  often  arbitrarily  prorate  the  information 
among  the  categories  of  Interest  in  order  to  accomplish  his  own  ends. 

Specious  Accuracy 

Data  may  be  accurate  in  the  sense  that  there  have  been  no  errors 
from  initial  observations  to  final  reporting,  yet  they  may  not  really 
represent  the  particular  behavior  that  one  supposes.  Such  misleading 
accuracy  is  said  to  be  specious. 

For  example.  Operation  and  Maintenance  resources  expended  for 
base  support  on  Air  Training  Command  (ATC)  bases  are  normally  identi¬ 
fied  as  "Training  Support."  Although  such  accounting  may  precisely 
reflect  support  costs  on  those  bases,  the  "training  support"  label 
clouds  the  fact  that  the  cost  of  support  rendered  to  other  major  com¬ 
mand  tenants  is  also  included;  these  costs  may  be  largely  independent 
of  the  training  function. 

As  another  example,  consider  maintenance  data  obtained  from  an 
independent  sample  that  is  designed  to  circumvent  the  problem  of  in¬ 
flationary  (or  deflationary)  reporting.  Such  data  may  better  measure 
the  actual  maintenance  needs  of  various  equipment  than  does  the  es¬ 
tablished  reporting  system.  However,  it  might  be  a  mistake  to  base 
an  estimating  relationship  (EE)  on  these  data;  the  ER  may  estimate 
maintenance  nueds,  but  may  not  reflect  maintenance  practice. 
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II.  SOME  SAMPLING  CONCEPTS 

TOWARD  REPRESENTATIVE  SAMPLES 

Section  I  tacitly  recomoends  a  basic  distrust  of  data  recorded  by 
anyone  but  the  cost  analyst  who  will  use  those  data.  The  suggestion 
has  been  for  the  analyst  to  determine  at  what  point  objectionable  error 
occurs  in  the  data  handling  process  and  to  go  to  that  point  and  make 
his  own  observations  (or  engage  a  well  trained  staff  of  observers). 

When  data  are  voluminous ,  customized  collection  implies  the  use  of  sam¬ 
pling  method.  The  task  remains  to  discuss  how  to  insure  that  a  sample 
is  representative  of  the  total,  for  this  is  the  necessary  assumption  if 
decisions  are  to  be  based  on  sample  information. 

For  it  to  be  representative,  one  might  specify  that  the  sample 
reflect,  in  proper  proportion,  the  various  attributes  of  the  population 
under  study.  The  sample  need  not  be  an  exact  miniature  of  the  popula¬ 
tion  to  be  useful;  the  allowable  latitude  in  this  respect  depends  or 
how  sensitive  the  analyst's  purposes  are  to  errors  in  estimates.  A 
discussion  of  sample  representation  involves  terms  such  as  population, 
distribution,  bias,  and  error.  These  notions  will  be  described, since 
their  meanings  as  used  in  sampling  may  differ  from  comaon  use. 

Populations  and  Their  Distributions 

Sampling  is  motivated  by  the  desire  to  evaluate  some  characteris¬ 
tic  of  Interest  in  order  to  aid  subsequent  decisionmaking.  In  statis¬ 
tical  terminology,  the  population  is  the  complete  set  of  values  of 
that  characteristic.  Specification  of  the  population  requires  defini¬ 
tion  in  terms  of: 

(1)  Content.  What  characteristic  of  the  population  is  under 
evaluation? 

(2)  Units.  What  are  the  units  into  which  the  population  can 
be  divided? 

(3)  Extent.  What  are  the  boundaries  of  the  population? 


Frequency  ot  occurrence 
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(4)  Time.  What  is  the  time  interval  during  which  information 
is  relevant,  and  what  is  the  time  interval  for  which  an  in¬ 
ference  is  to  be  drawn? 

If  the  problem  is  to  estimate  average  fuel  consumption  of  a  par¬ 
ticular  model  aircraft  in  Fiscal  Year  1967,  the  population  is  the 
collection  of  fuel  consumption  rates  for  each  such  aircraft  that  was 
operational  during  FY  1967.  If,  on  the  other  hand,  the  problem  is  to 
estimate  (l.e.,  forecast)  average  fuel  consumption  of  that  aircraft 
during  FY  1968  -  FY  1973,  the  population  is  the  collection  of  fuel 
consumption  rates  for  each  such  aircraft  operational  during  those  five 
year 8.  Lacking  clairvoyance,  the  procedure  in  the  latter  case  would 
be  to  substitute  a  related  population,  e.g.,  all  relevant  experience 
in  the  past  year,  and  assume  that  the  substitute  population  reflects 
the  target  population  closely  enough  for  practical  purposes. 

Populations  can  be  characterized  by  their  distributions.  Suppose 
it  is  possible  to  categorize  each  unit  in  a  population  according  to 
its  value,  and  then  prepare  a  graph  of  the  frequencies  with  which  each 
category  is  represented.  The  result  is  a  frequency  diagram  of  the 
distribution,  which  represents  a  visual  illustration  of  the  population. 
For  example,  if  the  fuel  consumption  rates  for  all  operational  aircraft 
are  allocated  into  20  gallon/hour  categories,  the  result  might  be 
graphed  as  follows: 


Fuel  con;umption  rates,  gallons  per  hour 
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The  choice  of  category  width  is  arbitrary,  and  is  rather  a  matter  of 
visual  taste;  if  the  categories  are  made  very  small  (e.g. „  1  gallon/ 
hour  Intervals) ,  the  graph  begins  to  assume  the  appearance  of  a  smooth 
curve.  For  simplicity,  all  subsequent  frequency  diagrams  in  this 
document  will  be  pictured  as  smooth  curves. 

Error  and  Bias 

A  sampling  procedure  is  usually  judged  by  the  accuracy  with  which 
it  reflects  the  population,  or  with  which  it  provides  estimates  of  pop¬ 
ulation  characteristics  (such  as  the  population  average).  This  accuracy 
is  composed  of  two  factors,  sampling  error  and  bias.  Consider  a  sampling 
procedure  in  which  ten  observations  are  taken  from  a  population  of 
fifty,  and  their  average  value  recorded.  Suppose  that  this  procedure 

•/f 

were  repeated  an  infinite  number  of  times.  There  would  result  quite 
a  number  of  sample  average  values,  but  they  would  tend  to  concentrate 
within  some  sharply  defined  region.  This  dispersion  of  sample  averages 
is  called  sampling  error.  Now,  it  is  conceivable  that  these  sample 
averages  might,  in  turn,  be  averaged  to  produce  a  "grand  sample  aver¬ 
age,"  and  that  the  latter  may  not  coincide  with  the  characteristic  being 
estimated  (i.e.,  the  average  of  the  population  taken  as  a  whole).  The 
difference  between  the  population  average  and  the  average  of  sample 
averages  is  due  to  bias  in  the  sampling  procedure.  Suppose  this  ex¬ 
ample  produced  results  graphed  below. 


23 


True  mean  of 
population 

Mean  o?  sample 
a  vet  ages 


I  1  I  i,  MJUIIlljlJiiii  llpi-L 


!_i 


24 


25 


26 


*That  is,  a  sample  of  ten  is  selected,  recorded,  and  replaced, 
then  another  sample  of  ten  is  selected,  recorded,  and  replaced,  etc. 
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The  population  average  Is  25 .2  and  the  average  of  sample  averages  Is 
24.5.  Sampling  error  ranges  within  about  1.0  units,  and  the  bias  as¬ 
sociated  with  this  sampling  procedure  is  equal  to  .7  (i.e.,  25.2  - 
24.5  -.7).  The  combined  effect  of  sampling  error  and  bias  may  or  may 
not  preclude  the  usefulness  of  the  sampling  procedure,  depending  on 
the  accuracy  required  by  the  problem. 

A  helpful  analogy  is  to  consider  the  markmanship  of  three  rifle¬ 
men,  where  the  riflemen  are  attempting  to  "estimate"  the  center  of  the 
bullseye: 


The  target  on  the  left  was  turned  in  by  marksman  A.  He  has  a  very 
steady  arm,  but  apparently  suffers  from  astigmatism;  although  his  aim 
is  precise  (i.e.,  small  sampling  error),  he  consistently  misses  his 
mark.  Marksman  B  has  no  bias  in  his  score,  but  his  precision  is  quite 
a  bit  less  than  A.  Marksman  C  displays  a  small  bias  end  more  preci¬ 
sion  than  A.  Since  marksman  C's  particular  mix  of  precision  and  bias 
tends  to  consistently  put  him  nearer  to  the  center  of  the  target,  we 
would  probably  consider  him  the  most  accurate  of  the  three. 

In  light  of  the  previous  discussion  on  errors  in  cost  data,  we 
might  say  that  marksman  A  could  represent  estimates  resulting  from  the 
use  ct  data  coming  out  of  any  existing  reporting  system;  the  estimates 
are  consistent,  but  their  bias  ends  to  invalidate  their  usefulness. 
Marksmen  B  and  C  might  represent  two  alternative  sampling  schemes. 
Scheme  B  is  virtually  free  of  bias  but  is  burdened  by  large  sampling 
error.  Scheme  C  displays  some  bias  but  has  the  saving  grace  of  small 

a 

Illustration  adapted  from  Jesscn. 
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sampling  error.  Scheme  C  would  probably  be  the  preferred  (provided  the 
magnitude  of  the  bias  could  be  assessed). 

Probability  Samples 

There  are  two  broad  approaches  to  a  representative  sample:  (1) 
judgment  sampling  and  (2)  probability  sampling.  In  judgment  sampling, 
the  analyst  relies  on  his  experience  and  skill  to  select  a  number  of 
sample  points  that  are  "typical"  of  the  total  population  under  study. 
The  judgment  sample  is  characterized  by  the  following  comments: 

(1)  Accuracy  may  vary  from  sampler  to  sampler,  but  (for  a  given 
sampler)  is  fairly  uniform  as  sample  size  is  varied. 

(2)  There  is  generally  some  bias  present. 

(3)  There  is  no  objective  measure  of  the  combined  effects  of 
sampling  error  or  bias. 

Probability  sampler  are  drawn  with  the  aid  of  a  table  of  random  numbers 
or  any  other  device  that  assures  that  each  sample  point  selection  is 
Independent  of  all  others.  The  general  characteristics  of  probability 
sampling  are: 

(1)  Accuracy  is  not  dependent  on  who  is  doing  the  sampling,  but 
it  is  dependent  on  sample  size. 

(2)  There  is  no  sampling  bias. 

(3)  Sampling  error  can  be  estimated  objectively. 

Sampling  error  can  usually  be  estimated  from  a  single  sample,  but 
very  rarely  is  it  possible  to  estimate  bias.  A  highly  experienced 
sampler  who  is  intimately  familiar  with  the  subject  under  analysis  may 
be  able  to  satisfactorily  convince  himself  that  the  bias  in  his  proce¬ 
dure  is  "reasonable."  But  the  researcher’s  audience  is  typically  a 
skeptical  one  and  is  inclined  to  have  less  faith  in  his  Judgment.  The 
presence  of  bias  muddles  up  any  objective  statement  of  accuracy.  For 
this  reason,  it  is  usually  easier  to  accept  a  lot  of  sasqillng  error 
rather  than  a  little  bias.  This  also  motivates  this  paper's  near  total 
emphasis  on  probability  sampling. 


-13- 


STATISTICAL  BASIS  FOR  INFERENCE 

The  next  several  pages  review  some  basic  statistical  principles 
as  they  relate  to  sampling  and  develop  the  line  of  reasoning  that  sup¬ 
ports  the  sampling  method  as  a  basis  for  decisionmaking.  For  those 
who  are  already  convinced  of  the  credibility  of  probability  sampling, 
this  discussion  will  hold  little  Interest. 

In  the  remainder  of  this  Memorandum,  population  parameters  are 
denoted  by  Greek  letters  and  sample  statistics*  are  denoted  by  Roman 
letters: 

Population  Sample 

mean  li  (mu)  X 

2  2  2 
variance  a  (sigma)  S 

The  size  of  the  population  is  represented  by  N,  and  sample  size  is  n. 

An  unbiased  estimate  of  a  parameter  is  indicated  by  placing  a  "hat," 

A,  over  the  parameter  symbol.  Thus,  to  say  that  a  sample  mean  is  an  un¬ 
biased  estimate  of  che  population  mean  is  equivalent  to  the  expression: 

Cl  »  x 

Descriptors  of  Populations  and  Samples 

Recall  the  earlier  discussion  of  population  distributions.  There 
are  generally  two  characteristics  ot  any  population  distribution  that 
interest  the  analyst:  central  tendency  and  dispersion. 

Two  measures  of  central  tendency  are  the  median  and  the  Man.  If 
all  units  (noted  as  X^)  in  the  population  are  arrayed  in  order  of  size, 
the  median  is  the  value  of  the  middle  unit.  The  swan  is  the  average 
of  the  population  units: 


N 


* 

Parameters  are  constants  associated  with  the  population;  statis¬ 
tics  are  numbers  calculated  from  the  sample,  and  therefore  are  vari¬ 
able  from  sample  to  sample. 
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Of  the  two  parameters,  the  mean  is  most  often  of  interest,  especially 
with  near-symmetric  distributions.  The  median,  on  the  other  hand,  is 
independent  of  the  distribution,  and  is  often,  therefore,  the  preferred 
parameter  in  situations  where  the  shape  of  the  distribution  is  irregu¬ 
lar. 

The  most  conmon  measure  of  dispersion  is  the  variance,  defined  as 
the  average  of  squared  deviations  of  units  from  the  mean: 

N 

2  2  (Xt-^) 

a  - - — 

The  standard  error  (or  standard  deviation)  is  the  square  root  of  the 
variance : 


a.fc 

An  alternative  measure  of  dispersion  is  the  mean  deviation: 


N 

I 


<IV»I) 

N 


The  mean  deviation  is  seldom  used;  the  standard  deviation  is  more  popu¬ 
lar  because  of  its  relationship  to  confidence  intervals,  to  be  discussed 
later. 

A  sample  is  some  oortion  of  the  population  composed  of  n  units. 

2 

Analogous  to  the  two  population  parameters,  p,  and  c  ,  are  the  sample 
mean  and  the  sample  variance: 


n 

r  xi 

n 


n 


E  (Xi-X)2 


These  are  called  statistics  because  they  are  variables  dependent 
the  particular  assortment  of  n  units  chosen  for  the  sample. 


on 
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Sampling  Distributions 

Being  a  variable,  a  statistic  also  has  a  distribution.  This  dis¬ 
tribution  is  called  the  sampling  distribution,  since  it  reflects  the 
frequencies  with  which  the  statistic  would  take  on  different  values  if 
the  sampling  procedure  were  repeated  an  infinite  number  of  times.  The 
expected  value  of  a  statistic  is  defined  as  the  mean  of  its  sampling 
distribution.  For  example,  the  following  might  be  the  sampling  distri¬ 
bution  of  S  [the  expected  value  of  S2  is  denoted  as  E(S2)]: 


Since  the  purpose  of  sampling  is  to  obtain  information  about  the 
population,  we  are  generally  concerned  that  our  sample  statistics  are 
accurate  estimates  of  the  corresponding  population  parameters.  The 
two  aspects  of  accuracy,  precision  and  bias,  can  now  be  characterized 
in  terms  of  the  sampling  distribution,. 

An  estimator  (i.e.,  the  formulas  actually  used  in  deriving  esti¬ 
mates)  is  unbiased  if  the  expected  value  of  the  statistic  is  equal  to 
the  parameter  it  estimates. 

An  estimator  is  precise  if  it  has  a  relatively  narrow  sampling 
distribution  (i.e.,  if  the  sampling  error  is  small). 

The  diagram  below  represents  the  sampling  distribution  of  X  for 
two  different  sampling  procedures  superimposed  upon  the  distribution 
of  the  parent  population  (dashed  lines): 


Since  the  distribution  is  conceptually  derived  from  an  infinite 
number  of  iterations,  its  graph  is  drawn  in  terms  of  relative  fre? 
quency.  For  geometric  interpretation,  the  probability  (P)  that  S2  will 
assume  a  value  within  some  interval  is  equal  to  the  percentage  area 
under  the  curve  that  is  bounded  by  that  interval  (e.g.,  p[l4,0  <  S2  < 
15.0]  =  .10). 
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Estimator  B  is  unbiased  (its  expected  value  is  equal  to  p,)  but  not 
very  precise.  Estimator  A  is  more  precise  but  is  biased: 

Bias  (A)  *  65  -  60  *  5 


From  the  diagram,  it  appears  that  a  precise,  slightly  biased 
estimator  might  be  preferable  to  an  unbiased,  less  precise  estimator. 
This  tradeoff  is  difficult  to  evaluate  since  p,  is  unknown.  The  usual 
practice  is  to  follow  procedures  that  are  known  to  produce  unbiased 
estimators,  then  select  the  estimator  that  has  the  greatest  preci¬ 
sion.  host  of  the  unbiased  procedures  require  probability  sampling. 

The  requisites  for  probability  sampling  are; 

(1)  Every  unit  of  the  population  has  a  known  probability  of 
being  included  in  the  sample. 

(2)  The  sample  is  drawn  by  some  method  of  random  selection 
(each  seiec  'on  is  independently  determined). 

(3)  Probabilities  of  selection  are  taken  into  account  when  making 
estimates  from  the  sample. 

Probability  sampling  methods  provide  unbiased  estimates  of  popu¬ 
lation  parameters,  or  contain  certain  bias  that  can  be  evaluated.  For 

example,  X  is  always  an  unbiased  estimate  of  p.  if  probability  sampling 

2 

has  been  employed;  the  sample  variance,  S  ,  is  a  biased  estimator  of 

2 

the  population  variance,  o  ,  but  the  bias  is  corrected  by  a  simple 

adjustment  factor,  -2-r  .  Non-probability  methods,  such  as  judgment 
J  n- 1 
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sampling,  may  provide  more  precise  estimates,  but  it  is  usually  impos¬ 
sible  to  identify  bias.  Probability  sampling  also  furnishes  informa¬ 
tion  on  the  sampling  distributions  of  the  estimators,  and  thus  provides 
the  bridge  necessary  to  be  able  to  draw  inferences  about  the  population, 
based  on  the  sample. 

Confidence  Intervals 

So  far  it  has  been  shown  that  samples  can  provide  useful  estimates 
of  population  parameters;  it  is  known  that  if  the  sampling  procedure 
is  repeated  an  Infinite  number  of  times,  the  average  values  of  X  and 
(yyr)  S  will  be  p,  and  a  .  A  question  remains,  however,  about  the 
inferences  drawn  from  a  single  sample;  how  close  is  X  to  p,?  This  can¬ 
not  be  determined  with  certainty,  but  thanks  to  a  very  helpful  charac¬ 
teristic  of  nature  which  is  expressed  as  the  central  limit  theorem,  it 
is  possible  to  specify  the  shape  of  the  sampling  distribution  of  X  and 
thereby  find  the  probability  that  the  quantity  |x  -  u|  is  within  some 
specified  tolerance  level.  The  central  limit  theorem  provides  that, 
as  sample  size  increases,  sample  means  tend  to  be  distributed  normally 
regardless  of  how  the  parent  population  is  distributed.  The  distribu¬ 
tion  of  sample  means  has  the  same  mean  as  the  parent  population,  but 

a 

its  standard  error  is  equal  to  . 

Pictured  below  are  the  population  distribution  (dashed  line)  and 
the  sampling  distribution  of  X  from  samples  of  size  n=10: 
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Since  the  distribution  of  X  is  approximately  normal,  theory  tells  us 

that  about  68  percent  of  the  area  beneath  the  curve  lies  within  one 

standard  error  (a—)  of  |i,  and  95  percent  of  the  area  lies  within  two 

standard  errors.  Another  interpretation  is  that  the  probability  that 

X  will  fall  within  one  standard  error  p  is  .68.  Standardized  normal 

curve  tables  are  available  in  most  texts  that  provide  this  information 

for  fractional  multiples  of  a_  . 

r  x 

Suppose  that  a  sample  is  drawn  from  the  above  population  and  the 
two  statistics  computed: 


X  =  43,  S2  =  25. 

Neither  p  nor  c_  is  known,  but  aZ.  can  be  estimated: 
x  x 


2  o  o 

<L  x  /JL)  Sz  25 

n  'n-K  n  n-1  ~  9 

A  as lightly  biased  estimate  of  cr_  is  found  by  finding  the  square  root  of 
2  * 

(a_).  From  the  previous  discussion  it  is  known  that  if  the  sample  were 
x 

drawn  repeatedly,  95  percent  of  the  sample  means  would  fall  within  2o_ 
of  the  population  mean.  This  statement  is  equivalent  to  saying  that  u 
is  within  2a—  of  the  sample  mean  95  percent  of  the  time.  Thus  it  is 

said  that  the  95  percent  confidence  interval  for  p  is  43  +  10/3.  This 

Joes  not  mean  that  the  probability  that  p  lies  in  this  interval  is 
.95;  however,  if  one  were  to  follow  this  procedure  for  setting  confi¬ 
dence  intervals  in  sample  after  sample,  he  would  expect  his  intervals 
to  contain  P  95  percent  of  the  time. 

The  only  flaw  in  the  procedure  for  arriving  at  confidence  inter- 

2  2 

vals  has  been  the  use  of  S  to  estimate  o  .  Fortunately,  this  only 
causes  problems  with  small  samples,  which  are  discussed  on  page  26. 

The  concept  of  sampling  errur'may  be  a  little  ponderous  for  the 

decisionmaker  to  employ.  If  the  dec tsionmaker  desires  a  certain  maxi¬ 

mum  tolerance  in  order  to  use  the  sample  results,  the  sampler  can  esti¬ 
mate  the  odds  that  the  tolerance  will  be  met  (absolute  assurance  of  a 
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given  tolerance  is  fcr  most  purposes  impossible).  Whether  the  odds 
are  acceptable  depends  on  the  decisionmaker's  aversion  to  the  risk  of 
incorrect  conclusions. 

As  an  example,  suppose  an  on-site  sample  survey  has  been  made  of 
the  number  of  direct  depot  man  hours  required  to  repair  and  refurbish 
a  certain  missile  guidance  system  component.  The  sample  yields  a 
statement  that  on  the  average  1191  direct  man  hours  are  required  for 
each  of  the  components  of  interest.  From  information  concerning  the 
sample,  an  analyst  can  estimate  the  odds  of  achieving  a  specified  tol¬ 
erance.  This  might  be  stated  as  "the  population  mean  is  equal  to  1191 
direct  man  hours  +  12.5  hours  (tolerance)  at  .93  confidence."  If  the 
analyst  is  willing  to  act  upon  estimates  with  this  tolerance  and  con¬ 
fidence,  the  survey  has  provided  useful  information.  A  more  conserva¬ 
tive  analyst  might  feel  comfortable  only  when  this  tolerance  is  achieved 
with  .99  confidence,  in  which  case  the  sample  procedure  would  need  to 
be  revamped  to  obtain  better  representation  of  the  population. 

SIMPLE  RANDOM  SAMPLING 

The  simplest  probability  sample  is  the  simple  random  sample.  The 
required  conditions  are: 

(1)  Independert  selection  of  sample  units. 

(2)  Equal  probability  of  selection  for  all  units  in  the  popula¬ 
tion. 

The  first  condition  specifies  that  the  inclusion  of  a  particular  unit 
in  the  sample  is  in  no  way  dependent  on  the  inclusion  of  some  other 
unit;  this  is  accomplished  by  randomizing  the  selection.  The  second 
condition  assures  that  the  sample  will  not  be  biased. 

Both  conditions  are  implemented  by  proper  selection  of  a  sampling 
frame.  A  frame  is  a  list;  a  way  of  dividing  the  population  into  sam¬ 
pling  units  that  are  distinct  and  non -over lapping  and  that  together  con¬ 
stitute  the  whole  of  the  population.  A  suitable  frame  allows  the 
listing  oi  numbering  of  all  units  in  order  to  make  a  random  selection 
(although  for  some  sampling  procedures  to  be  discussed  later  the  com¬ 
plete  list  is  not  necessary). 
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A  table  of  random  numbers  is  one  way  to  draw  the  sample.  Suppose, 

for  example,  that  a  sample  of  size  n*10  is  desired  from  a  population 

of  N-452.  Choose  some  arbitrary  point  in  a  table  of  random  numbers 

and  read  down  the  column  of  3-dlgit  numbers,  picking  out  the  first  ten 

numbers  that  do  not  exceed  452.  The  sample  consists  of  those  sampling 

units  that  correspond  to  the  chosen  numbers  (any  number  appearing  more 

* 

than  once  should  be  ignored  after  the  first  time). 


Choice  of  Sample  Size 

The  choice  of  sample  size  involves  a  tradeoff  between  cost  and 
precision;  increased  precision  requires  a  larger  sample  size,  which 
in  turn  implies  higher  cost.  For  the  analyst  who  does  not  have  a 
fixed  budget,  it  is  probably  more  meaningful  to  translate  sampling 
cost  to  sampling  time  (assuming  the  preferred  path  to  a  solution  is 
the  shortest  path);  cost  and  time  can  be  considered  synonymous.  The 
typical  procedure  for  determining  sample  size  is  to  specify  some  level 
of  precision,  solve  for  sample  size  required  for  several  alternative 
sampling  schemes,  then  compare  costs  (and  possibly  adjust  the  preci¬ 
sion  requirement  if  costs  for  all  alternatives  are  out  of  line  with 
the  budget).  The  following  steps  assume  imple  random  sampling;  the 
rationale  is  the  same  for  other  sampling  schemes,  but  the  computation 
is  more  complex. 

The  first  3tep  is  to  decide  how  large  an  error  can  be  tolerated 
in  the  estimate.  This  requites  careful  thinking  about  the  use  to  be 
made  of  the  estimate  and  about  the  consequences  of  sizable  error  (is 


*There  is  nothing  essential  about  the  use  of  random  number  tables, 
for  more  simple  devices  such  as  tossed  dice  or  numbered  chips  drawn  from 
a  hat  will  often  do.  Sometimes  it  may  be  assumed  that  the  population 
units  occur  randomly  in  the  sampling  frame,  so  that  any  arbitrary  se¬ 
lection  is  valid;  for  example,  if  one  is  sampling  40  airmen  to  esti¬ 
mate  the  average  skill  level  of  airmen  at  a  particular  base,  the  first 
40  airmen  listed  in  the  base  directory  can  probably  be  regarded  as  a 
random  sample  (since  skill  level  is  not  related  to  surname).  Care 
should  be  exercised  that  such  devices  actually  do  assure  independent 
and  equal  probability  of  selection.  The  advantage  of  a  random  number 
table  is  that  such  assurances  are  scientifically  provided. 
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the  estimate  to  be  very  precise  or  just  a  rough  estimate?).  The 
figure  arrived  at  may  be,  to  some  extent,  arbitrary,  but  this  is  the 
necessary  step  that  patterns  the  sample  estimate  to  the  objective  of 
the  analysis.  The  second  step  is  to  express  the  allowable  error  in 
terms  of  confidence  limits.  Suppose  L  is  the  allowable  tolerance  in 
the  sample  mean,  and  we  are  willing  to  take  a  5  percent  chance  that 
the  error  will  exceed  L  (we  want  to  be  "reasonably  certain"  that  the 
error  will  not  exceed  L).  The  95  percent  confidence  limits  computed 
from  a  single  mean  are; 


X  + 


2ox  = 


7  ,  2c 
X  +  t=- 
~  /n 


Since  the  tolerance  is  L: 


L 


n 


The  general  formula  is: 


n 


2  2 


z  a 


where  z  is  the  standard  normal  deviate,  i.e.,  the  multiple  of  cry  that 
corresponds  to  the  desired  confidence  interval. 


I —  Desired  %  — | 


The  appropriate  z-values  can  be  found  in  tables  of  standard  nor¬ 
mal  deviates  in  most  statistics  texts. 
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In  order  to  use  this  formula,  an  estimate  ot  a  is  necessary.  This  may 
be  accomplished  by  a  small  preliminary  sample,  or  by  examining  previous 
samplings  of  similar  populations.  For  populations  of  size  greater  than 
500,  a  crude  estimate  of  c  is  (range)/6,  the  range  being  defined  as  the 
difference  between  the  highest  and  lowest  values  in  the  population. 

Having  calculated  the  sample  size  required  for  the  stated  preci¬ 
sion,  the  third  step  is  to  evaluate  the  sample  cost.  If  the  cost  is 
high,  it  may  be  necessary  to  relax  the  precision  requirement.  It  may 
even  appear  preferable  to  give  up  the  sampling  plan  altogether  in  favor 
of  a  complete  census^ 

Sampling  for  Attributes 

Population  characteristics  can  be  classified  as  quantitative  or 
qualitative.  Quantitative  characteristics  (e.g,,  annual  income)  are 
called  variates  and  are  expressed  numerically.  Qualitative  character¬ 
istics  (e.g.,  sex)  are  called  attributes  and  are  non-numerical.  Sampl¬ 
ing  of  variates  leads  to  the  estimation  of  totals  and  averages;  sampl¬ 
ing  of  attributes  leads  to  the  estimation  of  proportions,  or  percent¬ 
ages.  The  various  sampling  designs  generally  apply  in  both  cases,  the 
main  difference  being  the  fora  of  the  estimators  (i.e.  ,  the  formulas 
used  in  deriving  estimates).  There  has  been  no  attempt  in  this  survey 
to  grant  "equal  time”  to  attributes,  since  the  discussion  and  examples 
would  simply  parallel  that  of  variates. 

Consider  a  study  to  determine  the  proportion  of  overseas  Air  Force 
installations  that  maintain  their  own  telephone  switchboard  facilities; 
each  base  selected  for  the  sample  would  be  classified  as  either  (A) 
maintaining  its  own  facility  or  (B)  contracting  that  function  out.  The 
frequency  distribution  has  the  following  form: 


>N 
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If  A  and  B  were  each  assigned  a  numerical  value,  this  distribution 
could  be  handled  the  same  as  the  variate  case,  'the  analyst  is  usually 
interested  in  determining  the  proportion  of  units  exhibiting  property 
A: 


This  number  is  the  same  as  p,  if  every  sample  unit  exhibiting  character¬ 
istic  A  is  given  a  value  of  one  (1),  and  all  other  sample  units  are 
valued  at  zero: 


N 

N.  EX. 

nA  “  N  “  N  ^A 

Assuming  simple  random  sampling,  the  sample  proportion,  p,  is  an 
unbiased  estimator  of  the  population  proportion,  IT.  The  variance  of 
the  sampling  distribution  of  p  takes  the  following  form: 


o2  _  niizEJ. 

p  11 


Its  unbiased  estimator  is 


S2  ,  filial 

p  n 

Sometimes  the  intent  is  to  estimate  N^,  the  total  units  in  the 
population  having  the  desired  attribute.  The  appropriate  estimators 
here  are: 


»AN 
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The  sampling  distribution  of  p  (and  of  pN)  has  the  desirable 
central  limit  theorem  property  of  tending  toward  normal  as  the  sample 
size  increases.  However,  if  either  TT  or  (1-TT)  is  very  small,  very 
large  sample  sizes  may  be  required.  This  is  because  the  sampling  dis¬ 
tribution  tends  to  be  non-synmetrical  for  values  of  IT  that  are  very 
high  or  very  low: 


As  a  rule  of  thumb,  the  following  conditions  should  hold  before 
relying  on  normal  distribution  properties: 

np  >  5  <  nq. 

For  values  of  p  that  are  very  large  or  very  small,  it  is  much 
cheaper  In  terms  of  sample  size  to  base  confidence  intervals  on  the 
properties  of  the  Poisson  distribution  (a  general  class  of  skewed 
distributions)  or  the  blnomiai  distribution.  Reference  to  these  two 
distributions  can  be  found  in  moat  basic  statistics  texts  (e.g.,  Hoel). 

Finite  Population  Correction 

This  paper  assumes  non-replacement  sampling  throughout.  This  l - 
the  general  class  of  samples  In  which  individual  population  units  are 
nor  allowed  to  appear  in  the  sample  store  chan  once;  l.e. ,  there  is  no 
duplication  In  the  selecting  of  random  mmbet*.  When  sampling  with 
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non-replacement  from  finite  populations,  it  is  necessary  to  introduce 
the  factor  (1  -  *■)  into  the  computation  of  sampling  variance.  Hence: 


This  factor  is  called  the  finite  population  correction  (fp*.),  and  as¬ 
sures  that  the  estimated  sampling  variance  tends  to  zero  as  the  sample 
size  approaches  the  population  size  N.  In  practice,  the  fpc  can  be 
ignored  when  the  sampling  fraction  is  not  greater  than  5  or  10  percent 
The  effect  of  ignoring  the  correction  is  to  overestimate  the  standard 
error,  which  generally  is  not  as  serious  as  underestimation. 

Dissecting  the  Sampling  Variance  Estimator 

It  may  be  of  Interest  to  sunmarize  the  "anatomy"  of  the  sampling 
variance.  The  various  components  are: 

The  average  of  squared  deviations  of  sample  observa¬ 
tions  from  the  sample  mean;  simply  a  convenient  de¬ 
scriptive  measure  of  variability  within  the  sample, 

but  which  is  also  useful  because  of  its  relationship 
2 

to  o  . 

The  factor  necessary  to  convert  the  measure  of  sample 

variability  into  an  unbiased  estimate  of  population 

2 

variability  as  measured  by  o  . 

The  factor  that  converts  the  aeasure  of  population 
variability  Into  a  measure  of  variability  of  the  sampl¬ 
ing  distribution. 

The  factor  that  smkes  allowance  for  sampling  free;  Unit 
populations  (f  •  . 


n  -  2 
E  (X^X) 


n 

n-1 


1 

n 


1  -  f 
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Assembling  the  components  gives: 


o 

S- 

x 


c-f)  <;> 


£ 


(xt-x)\ 
n  y 


which  is  the  unbiased  estimator  of 


d  =  (1-f)  — 

x  n 


Examination  of  this  formula  draws  attention  to  the  fact  that  the 

2  2 
sampling  error  (a_)  depends  primarily  on  the  population  variance  (o  ) 

and  the  absolute  sample  size  (n).  The  relative  sample  size  (i.e.,  the 

fraction  of  the  population  sampled)  is  not  an  important  factor  in 

large  populations.  For  example,  50  observations  from  a  population  of 

20,000  will  give  an  estimate  about  as  precise  as  50  observations  from 

a  population  of  1,000,  provided  that  the  population  variances  are  the 

same. 

Confidence  Intervals  from  Small  Samples 

*%0ne  problem  in  the  determination  of  confidence  intervals  arises 
from  the  use  of  the  following  formula  for  determining  the  upper  and 
lower  limits. 


The  variable  z  is  the  standard  normal  deviate  that  corresponds  to  the 
degree  cf  confidence  desired.  This  formulation  rests  or.  the  fact  that 


z 


has  a  standard  normal  distribution  (i.e.,  |±=0,  and  ct=1).  Since  a  is  not 
usually  known,  it  Is  often  necessary  to  use  its  sample  estimator 
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instead.  The  expression 


follows  what  is  called  the  t-distribution.  The  t-distribution  is  very 
close  tc  normal,  but  has  wider  dispersion  when  the  sample  size  is  small. 


For  this  reason  it  is  preferable  to  use  t-values  instead  of  z-values 
when  sample  sizes  are  less  than  30;  upper  and  lower  limits  art  then 
determined  from  the  expression: 


lx  -  ni 


where  t  and  S  have  been  substituted  for  z  and  o,  respectively.  Tables 
are  available  from  which  t-values  can  be  ascertained  in  much  the  same 
manner  as  the  z-values,  except  that  the  sample  size  must  be  specified. 

A  portion  of  a  t-cable  appearing  in  R,  A,  Fisher's  1934  volume  of  Statisti¬ 
cal  Methods  for  Research  Workers  is  reproduced  below.  If,  for  example,  the 


degrees  of  freedom 

level 

of  sign: 

Lficance 

(n  -  1) 

.5 

.3 

.1 

.05 

.01 

13 

.694 

1.079 

1.771 

2.160 

3.012 

14 

.692 

1.076 

1.761 

2.145 

2.977 

15 

.691 

1.074 

1.753 

2.131 

2.947 

16 

.690 

1.071 

1.746 

2.120 

2.921 

17 

.689 

1.069 

1.740 

2.110 

2.S98 

This  boundary  between  "small"  and  "large"  samples  is  arbitrary; 
experience  has  shown  that  for  most  purposes,  the  z-distribution  suffi¬ 
ciently  approximates  the  t-distribution  when  sample  size  exceeds  25  to  30. 


desired  confidence  level  were  .70  and  the  sample  size  were  n=15,  the 
appropriate  t-values  would  be  t  s  1.076.  The  column  headings  refer  to 
the  area  under  the  "tail"  of  the  curve  (e.g. ,  a  <70  confidence  level 
implies  .3  significance).  The  row  headings  refer  to  degrees  of  freedom, 
a  rather  abstruse  statistical  concept  which  for  simple  random  sampling 
is  one  less  than  the  sample  size. 
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III.  ELEMENTS  OF  SAMPLE  DESIGN 

Designing  a  sample  Is  a  matter  of  getting  the  most  accuracy  for 
your  money,  and  is  a  problem  apart  from  that  of  obtaining  "valid"  re¬ 
sults  (In  the  sense  of  being  able  to  draw  correct  inferences).  Valid¬ 
ity  derives  from  adhering  to  the  rather  well  defined  rules  of  good 
procedure,  such  as  using  correct  estimators  and  maintaining  independ¬ 
ent  selection  of  sample  points,  A  designed  sample,  on  the  other  hand, 
seeks  to  utilize  prior  subjective  knowledge  about  the  population  in 
order  to  increase  accuracy  or  decrease  costs. 

SAMPLE  PRECISION  AND  COST 


Increasing  Precision 


Precision  is  increased  by  decreasing  the  variance  of  the  sampling 
distribution.  There  are  four  fundamental  methods  for  achieving  this 
result: 


(1)  Increasing  sample  size. 

(2)  Stratifying  the  population. 

(3)  Using  auxiliary  variables  in  the  estimator. 

(4)  Using  unequal  probabilities  of  selection. 


The  simplest  way  to  increase  precision  is  to  increase  the  sample 
size.  Tliis  has  already  been  discussed  in  connection  with  choosing 
the  sample  size  for  simple  random  samples. 

Stratification  involves  dividing  the  population  into  two  or  more 
subpopulations  and  sampling  from  each.  Stratification  always  reduces 
the  sampling  variance  provided  the  variability  within  strata  is  less 
than  the  variability  in  the  overall  population.  It  is  also  possible 
to  stratify  after  the  sample  has  been  dravm,  but  this  is  usually  not 
as  efficient. 

Sometimes  in  sampling  there  is  the  opportunity  to  observe  an 
auxiliary  variable  which  is  closely  related  to  the  main  variable  of 
interest,  and  which  can  be  utilized  in  the  estimator  to  give  more  pre¬ 
cise  estimates.  Two  such  estimators  (ratio  and  regression)  are  dis¬ 
cussed  in  Section  IV, 
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Unequal  probability  sampling  offers  another  way  of  making  good 
cat  of  an  auxiliary  variable.  As  well  as  being  used  in  the  estimator, 
the  auxiliary  variable  is  used  to  determine  the  probability  with  which 
various  samp  e  points  fall  into  the  sample.  Probabilities  are  set 
proportionally  to  the  auxiliary  variable,  and  the  closer  the  correla¬ 
tion  between  the  two  variables,  the  more  precise  the  final  estimate. 

These  methods  will  be  discussed  more  fully  in  the  process  of  de¬ 
scribing  the  several  sample  designs  which  follow  in  this  section  and 
in  Section  IV. 

Cutting  Survey  Costs 

Costs  of  running  a  survey  fall  naturally  into  four  categories: 

(1)  costs  of  observation, 

(2)  travel  costs, 

(3)  coding  costs,  and 

(4)  overhead  costs. 

Observation  costs  are  those  incurred  "on-location"  in  recording 
the  behavior  under  study.  These  costs  vary  directly  with  the  number 
of  sampling  points,  and  therefore  are  reduced  by  decreasing  the  sample 
size.  Any  of  the  sample  designs  that  offer  increased  precision,  for 
a  given  sample  size,  can  likewise  be  used  to  provide  the  same  preci¬ 
sion  at  less  observation  cost. 

Travel  costs  are  those  incurred  in  moving  between  sample  points 
and  home  base.  These  are  mostly  irrelevant  when  sampling  from  a  cen¬ 
tralized  reporting  system.  The  common  method  to  reduce  travel  is  to 
group  the  sample  points  into  clusters  so  that  the  sampler  can  pick  up 
several  observations  at  each  location  rather  than  just  one.  The  cluster 
technique  is  less  efficient  (less  precise)  ,  but  sometimes  the  reduction 
in  travel  cost  mav  allow  the  sampler  to  recoup  his  precision  loss  by 
selecting  more  sample  points.  Cluster  sampling  will  be  explained  later 
in  detail. 

Coding  includes  those  administrative  task":  relating  to  the  trans¬ 
formation  of  sample  information  recorded  by  field  workers  into  a  form 
that  is  amenable  to  analysis.  This  may  simply  require  the  consolidation 


of  data  from  several  worksheets,  or  It  may  involve  numerical  interpre¬ 
tation  of  responses  recorded  on  sample  questionnaires.  The  magnitude 
of  coding  costs  depends  on  the  mode  of  data  collection;  but  for  a 
given  mode,  they  vary  directly  with  sample  size.  Careful  planning  of 
sample  observation  procedures  may  lead  to  significant  savings  in  data 
handling  costs. 

Overhead  includes  such  items  as  frame  construction,  sample  selec¬ 
tion,  calculating  estimates.  These  costs  are  rather  insensitive  to 
different  sample  designs,  since  planning  and  design  probably  consti¬ 
tute  the  bulk  of  these  costs.  However,  if  the  population  is  very 
large,  the  choice  of  sampling  design  can  significantly  affect  the  time 
necessary  to  construct  a  frame  and  select  sample  points. 

Cost-Precision  Tradeoff 

It  has  been  stated  that  the  choice  of  sample  design  depends  on 
both  cost  and  precision  of  the  alternative  sampling  schemes,  but  there 
has  been  no  discussion  of  combining  the  two  into  a  single  measure. 

The  usual  measure  ic  Net  Relative  Efficiency  (NRE).  The  concept  of 
NRE  will  be  developed  by  means  of  a  simple  example. 

Suppose  two  alternative  sampling  schemes,  A  and  B,  are  available. 
For  a  sample  of  size  50,  it  is  estimated  that  sampling  variance  for 
scheme  A  will  be  35,  and  that  for  scheme  B  will  be  about  42.  The 
Relative  Efficiency  (RE)  of  A  to  B  is  the  inverse  ratio  of  the  vari¬ 
ances: 


RE(A/B) 


var  (A) 


42 

35 


1.20 


Scheme  A  is  said  to  be  20  percent  more  efficient  than  scheme  B,  A 
10  percent  sample  using  scheme  A  would  provide  the  same  precision  as 
a  12  percent  sample  using  scheme  B. 

What  about  costs?  The  costs  of  the  two  schemes  are  estimated, 
variable  costs  (those  proportional  to  sample  size)  are  separated  out, 
and  the  variable  cost  per  sample  point  for  each  is  computed.  This 
variable  component  for  A  is  $50,  and  for  B  is  $40  (these  costs  might 
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as  easily  have  been  stated  in  terns  of  man-hours).  The  Relative 
Variable  Cost  (RVC)  of  A  to  B  is  the  ratio  of  these  costs: 


svcw/!)  =  $!$ 


50 

40 


1.25 


Scheme  A  is  25  percent  more  costly  than  scheme  B  on  a  sample  point 
* 

basis. 

Since  sampling  variance  is  inversely  proportional  to  sample  size, 
and  the  RVC  is  on  a  sample  point  basis,  a  consistent  way  to  combine 
the  two  criteria  is  to  divide  the  Relative  Efficiency  by  the  Relative 
Variable  Cost.  The  new  measure  is  called  Net  Relative  Efficiency: 


NRE(A/B) 


RE_  =  L20=  %  I  =  var.,(B)  VC(B)  , 
RVC  1.25  *  L  var  (A)  .  VC(A)  J 


When  costs  are  considered,  scheme  A  is  4  percent  less  efficient  than 
B;  equivalently,  scheme  B  is  4  percent  more  efficient  than  A  (-~  * 
1.04). 

For  a  given  level  of  precision,  scheme  B  will  be  4  percent 
cheaper;  for  a  given  budget,  scheme  B  will  provide  4  percent  greater 
precision. 

The  choice  of  scheme  B  has  depended  on  some  necessarily  rough 
"guesstimates."  The  feeling  is,  however,  that  these  calculations 
lead  to  a  best  guess  when  performed  by  someone  with  good  subjective 
familiarity  with  the  behavior  under  study.  The  need  for  this  kind  of 
preliminary  analysis  illustrates  the  usefulness  of  prior  sample  sur¬ 
veys  that  are  well  documented.  The  concept  of  Net  Relative  Efficiency 
receives  detailed  treatment  in  Jessen,  pages  97-103. 

Again,  sampling  designs  are  motivated  by  the  desire  to  re-estab¬ 
lish  the  cost-precision  tradeoff  at  a  more  favorable  level  than  is 
obtained  by  simple  random  sampling.  The  most  basic  designs  are  de¬ 
scribed  in  the  remainder  of  this  section. 


Assume  equal  or  insignificant  fixed  cost. 
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BASIC  TECHNIQUES  OF  SAMPLE  DESIGN 

The  most  basic  of  the  sampling  techniques  are  those  that  "parti¬ 
tion"  the  population  so  that  the  resulting  sample  will  reflect  some 
special  knowledge  of  the  manner  in  which  the  population  units  natuially 
occur.  Four  techniques  are  considered: 

.  Stratified  sampling 
.  Cluster  sampling 
.  Sub samp ling 
.  Systematic  sampling 

These  are  the  foundations  of  the  more  complex  schemes  often  required 
in  real  world  applications  of  sample  surveys. 

Description  of  the  basic  sample  designs  will  include  the  motiva¬ 
tion  for  their  use:  why  the  design  is  used;  advantages  and  disadvan¬ 
tages;  relative  costs  of  application;  and  allocation  of  sample  units. 

A  simple  illustration  of  each  design  is  also  included. 

The  formulas  for  estimation  of  population  means  and  variances 
are  not  found  in  the  descriptions  but  are  given  in  Appendix  II,  This 
has  the  dual  purpose  of  (1)  smoothing  the  way  for  those  who  are  more 
interested  in  the  rationale  behind  different  designs  than  the  arith¬ 
metic  of  estimation,  and  (2)  gathering  the  various  formulas  into  a 
few  pages  for  easy  comparison.  Most  of  the  examples  include  measures 
of  estimation  and  may  prompt  the  interested  reader  to  refer  to  the 
appendix;  although  the  general  tone  of  the  applications  and  the  rea¬ 
soning  behind  the  choices  of  designs  should  be  apparent  without  having 
to  become  immersed  in  actual  numbers,  these  calculations  were  included 
for  those  desiring  to  see  the  formulas  in  action. 

Please  note  the  following  convention  for  stratification,  cluster¬ 
ing,  and  subsampling.  The  population  is  divided  into  N  partitions, 
of  which  n  partitions  are  designated  for  sampling;  each  partition  con¬ 
sists  of  M  data  points,  of  which  m  are  selected  for  the  sample.  Thus 
the  total  number  of  data  in  the  population  is  equal  *-o  MN,  and  the 
total  sample  size  is  an. 


In  stratified  random  sampling,  the  population  is  divided  into 
non-overlapping  subpopulations,  called  strata,  A  simple  random  sample 
is  then  drawn  in  each  stratum. 


Stratum  ^1  Stratum  ^2  Stratum  ^3  Stratum  ^4 


There  are  four  principal  reasons  fo:;  stratifying. 

First,  it  sometimes  is  desired  to  obtain  estimates  for  subdivi¬ 
sions  of  the  population. 

Second,  it  may  be  administratively  convenient  to  break  up  the 
population  into  strata  of  a  size  easier  to  work  with. 

Third,  sampling  problems  may  differ  in  different  parts  of  the 
population.  For  example,  in  sampling  long-haul  communications  person¬ 
nel  stationed  on  air  bases,  it  would  be  practical  to  put  SAC  and  ADC 
in  a  separate  stratum  since  they  administer  their  own  communications. 

The  data  sources  for  these  two  conmamis,  and  their  sampling  frame, 
would  be  of  a  different  nature  than  that  of  the  ocher  major  conmands, 
which  are  served  by  the  Air  Force  Conuuniication  Service. 

Fourth,  considerable  precision  may  be  gained  if  it  is  possible 
to  divide  a  heterogeneous  population  Into  strata  that  are  internally 
homogenous.  Differences  between  strata  do  not  contribute  to  the 
stratified  sampling  variance.  Thus,  the  i’ss  variability  within 
strata,  the  smaller  the  sampling  variance. 

The  simplest  way  to  allocate  the  sample  is  to  use  proportional 
allocation,  that  is,  to  make  the  number  oi  sample  units  drawn  from 
each  stratum  proportional  to  the  total  number  of  units  in  that  stratum. 
The  gain  in  precision  over  simple  random  sampling  is,  in  this  allocation, 
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entirely  due  to  that  prior  knowledge  of  the  population  that  led  to  its 
partitioning  (i.e.,  the  knowledge  that  all  of  the  variances  within 
strata  are  smaller  than  the  overall  population  variance). 

Proportional  allocation  overlooks  two  items  of  information  that 
may  be  at  the  disposal  of  the  analyst:  (1)  differences  in  variance 
(a^)  from  stratum  to  stratum,  and  (2)  differences  in  the  cost  (c^ 
associated  with  observing  a  unit  in  each  stratum.  Since  the  dual  pur¬ 
pose  is  to  minimize  both  overall  sampling  variance  and  cost,  it  follows 
that  more  units  should  be  drawn  from  high  variance  strata  where  sampl¬ 
ing  is  inexpensive.  When  the  stratum  sample  sizes  (n^)  are  set  propor¬ 
tional  to  the  respective  standard  deviations  (a^)  and  stratum  sizes 
(M.),  and  inversely  proportional  to  the  square  root  of  the  costs  (c^ , 
allocation  is  said  to  be  optimal.  The  fact  that  some  uncertainty  may 
be  attached  to  the  knowledge  of  and  c^  does  not  impair  the  lack  of 
bias  of  the  final  estimate  of  If  the  analyst  is  confident  in  his 

estimate  of  at  least  the  relative  magnitudes  of  the  c^  and  of  the  0^, 
it  is  better  to  use  optimal  allocation  rather  than  proportional  alloca¬ 
tion. 

Example .  Suppose  that  an  estimate  is  desired  for  the  average 
dollar-cost  of  replenishment  spares  for  a  tactical  fighter  with  the 
following  deployment  (by  command): 


Command 

No.  UE 

TAC  * 

450 

TAC-CCTW 

150 

PACAF 

75 

USAFE 

75 

750 

For  each  aircraft  there  is  a  record  o‘  all  major  modifications,  spar>  s 
consumed,  and  major  maintenance.  Each  aircraft  can  be  identitied  by 
its  tail-number.  Assume  that  the  desired  tolerance  for  estimated  aver 
age  cost  is  +  $300  at  the  90  percent  confidence  level. 

*Combat  Crc-  Training  Wing. 
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The  first  task  is  to  get  a  rough  estimate  of  the  standard  devia¬ 
tion  of  spares  costs  for  the  750  aircraft.  Suppose  an  informed  individ¬ 
ual  suggests  that  the  distribution  of  costs  is  fairly  bell-shaped,  but 
skewed  to  the  right;  furthermore,  he  feels  that  about  95  percent  of  the 
aircraft  have  spares  costs  equal  to  $11,250  JH  2800.  Noting  that  +2o 
usually  encompasses  95  percent,  the  standard  deviation  is  estimated  to 
be  ^(2800) ,  or  $1400. 

If  simple  random  sampling  were  applied  to  this  problem,  the  sample 
size  would  be  determined  as  follows: 

n  =  59.4  =  60 

One  could  expect:  to  do  better  by  stratifying  according  to  the  four 
command-categories  (TAC,  PACAF,  ett.)  above,  since  program  character¬ 
istics  (flying-hour  programs,  etc.)  are  likely  to  affect  spares  consump¬ 
tion.  Using  the  same  sample  size  as  above,  and  adopting  proportional 
allocation,  the  m^  for  the  various  strata  are; 

nj 

Stratum  1 

TAC  60(450/750)  -  36 

TAC-UCTW  60(150/750)  *  12 

PACAF  60(  75/750)  =  6 

USAFE  60(  75/750)  =  _6 

60 

The  design  may  be  improved  by  speculating  as  to  the  relative  dif¬ 
ferences  In  dispersion  and  sampling  costs  among  the  strata  by  using 
optimal  allocation: 

Mi^i 

M  proportional  Co  - - 

1  'c' 

l 

Suppose  one  could  expect  sampling  costs  overseas  to  *u-  double  those  in 
the  7.1.  Futthermorv,  one  might  expect  tin  dispersion  in  TAC-CCTW  to 
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be  one-haLf  the  dispersion  within  TAC,  with  the  other  strata  somewhere 
in  between.  Accordingly,  the  following  table  lists  relative  costs, 
standard  deviations,  and  the  allocation  that  results: 


Stratum 

Rela- 

ti-e 

Cost 

Rela¬ 

tive 

Disp, 

1 

- - - - 

mi 

TAC 

1 

4 

450(4/ 1^)  =  1800 

60(1800/2420)  =  44.6  =  45 

TAC-CCTW 

1 

2 

150(2/1^)  =  300 

60(  300/2420)  =  7.4  =  7 

PACAF 

2 

3 

75(3/2^)  =  160 

60(  160/2420)  =  4.0  =  4 

USAFE 

2 

3 

75(3/2^)  =  160 

2420 

60(  160/2420)  =  4.0  -  4 

60 

Notice  that  in  the  new  allocation,  high-cost  strata  are  sampled  less 
and  the  high-variance  stratum  is  sampled  more. 

Cluster  Sampling 

In  cluster  sampling,  the  population  is  divided  into  groups,  or 
clusters,  of  units.  Several  of  the  clusters  are  chosen  at  random., 
and  all  units  in  each  selected  cluster  become  part  of  the  samp’  .  The 
clusters  are  referred  to  as  primaries,  whereas  the  units  contained 
therein  are  secondaries. 


There  are  two  major  reasons  that  lead  to 
sampling. 

First,  there  is  sometimes  no  list  of  the 
which  to  base  a  sampling  frame  and  it  is  felt 


the  choice  of  cluster 

population  available  on 
that  such  a  list  would 
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be  too  expensive  to  construct,  whereas  it  is  relatively  easy  to  come 
by  a  list  of  clusters  ot  units.  Suppose  it  is  desired  to  sample  mes¬ 
sage  lengths  in  a  Communications  Sector.  The  practical  procedures 
would  be  to  sample  clusters  of  messages,  i.e.,  messages  received  at 
selected  installations  during  some  specified  time  interval. 

Second,  cluster  sampling  may  be  desirable  if  the  population  is 
such  that  travel  costs  can  be  reduced  by  selecting  adjacent  units. 

For  example,  if  failure  rates  for  some  item  of  base  equipment  are 
being  sample  l,  it  may  be  cheaper  to  select  a  number  of  bases  and  ob¬ 
serve  all  units  on  those  bases  than  to  take  a  simple  random  sample. 

The  relative  cost  for  specified  precision  (and  equivalently  the 
relative  variance  for  specified  cost)  is  (l)  proportional  to  the  rela¬ 
tive  cost  of  observing  one  cluster,  (2)  proportional  to  the  variation 
between  clusters,  and  (3)  inversely  proportional  to  the  relative  size 
of  the  cluster.  If  in  estimating  X  a  choice  is  to  be  made  between 
several  different  cluster  sizes,  it  can  be  shown  that  the  criterion 
is  to  choose  that  cluster  ize  that  minimizes  the  product  of  sampling 
variance  times  total  cost  (both  of  which  vary,  depending  on  cluster 
size) . 

When  cluster  sampling  is  chosen  as  a  matter  of  convenience,  the 
final  estimate  will  generally  be  less  precise  than  a  simple  random 
sample  e  le  same  size.  Therefore  the  decision  rests  on  whether  the 
cost  reduction  allows  the  selection  of  a  large  enough  sample  to  actu¬ 
ally  increase  precision.  This  situation  contrasts  with  the  stratified 
sample,  where  an  estimace  less  precise  than  that  from  simple  random 
sampling  is  very  unlikely,  and  would  almost  require  contrived  strata 
designed  specifically  for  that  result.  Of  course,  if  the  clustering 
were  designed  so  that  variation  with  in  clusters  was  greater  chan  that 
between  clusters,  then  the  estimate  would  be  more  precise  than  the 
simple  random  case.  Such  an  arrangement  is  not  likely;  it  is  topically 
easier  to  partition  the  population  into  groups  of  homogeneous  units 
(as  in  stratification)  than  heterogeneous  units. 

Example .  A  frequent  proolem  for  the  cost  analyst  is  to  estimate 
the  cost  of  consumption  items  that  are  common  to  more  than  one  system. 
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Frequently,  these  items  are  centrally  managed,  and  their  consumption 
reported  only  in  aggregate.  Let  us  postulate,  for  example,  a  study 
of  administrative/support  aircraft,  in  which  it  is  desired  to  know  the 
annual  cost  of  low  value  replenishment  spares  consumed.  A  large  group 
of  spares  are  conmon  to  two  aircraft  (aircraft  #1  and  aircraft  #2)  as¬ 
signed  to  100  world-wide  locations.  Consumption  accounting  is  fcv  com¬ 
modity  only,  necessitating  some  external  data  collection  for  study 
purposes.  One  solution  would  be  to  request  that  maintenance  managers 
at  each  of  the  100  locations  keep  detailed  records  of  the  final  appli¬ 
cation  of  the  common  spares  in  question.  This  would  be  time  consuming 
and  costly  and  would  probably  provide  more  detail  than  necessary.  What 
follows  is  a  cluster  sampling  design  that  would  probably  provide  very 
adequate  information  at  significantly  less  cost. 

Suppose  that  aircraft  tfl  is  stationed  cn  all  100  bases,  but  air¬ 
craft  #2  is  only  on  40  bases.  Designating  a  one-year  time  period  and 
defining  a  cluster  to  be  a  one-month  period  (12  clusters  per  base), 
the  population  contains  1200  clusters,  50  of  which  will  be  sampled. 
Since  all  comon  spares  in  question  sent  to  60  bases  are  consumed  by 
aircraft  #1,  attention  may  be  restricted  to  the  remaining  40  bases: 


Sample  50 
base-months 


Hie  procedure  will  be  to  estimate  the  proportion  of  common  spares  by 
aircraft  #2  in  the  smaller  stratum,  then  make  an  adjustment  to  allow 
for  the  other  60  bases. 

Fifty  clusters  are  randomly  chosen  from  the  smaller  stratum.  The 
maintenance  chief  at  each  selected  base  is  instructed  to  keep  records 
regarding  the  disposition  of  all  common  spares  during  the  particular 
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month(s)  chosen.  The  information  to  be  reported  is  the  month's  total 
consumption  of  common  spares  (PL)  and  the  consumption  recorded  for 
aircraft  #2  (X^).  When  all  the  information  is  in,  the  estimated  pro¬ 
portion  of  common  spares  going  to  aircraft  #2  for  the  40  bases  is: 

50 
T.  X. 

50 

s'  M 
"  i 

50  50 

where  I!  PL  is  the  total  sample  consumption  and  T  X^  is  consumption  by 

aircraft  #2.  The  sampling  variance  for  this  estimator  is  estimated 

by  the  formula  for  unequal  cluster  sizes,  substituting  P*  for  X 


50(M)249 


P°2 

[LMi  + 


250  2 

p«£  Mi 


°  1 
MiXiJ 


The  proportion  of  common  spares  consumed  by  all  bases  for  aircraft  # 2 
is  then  estimated  by  weighting  P^  to  allow  for  the  difference  between 
the  40  bases  and  the  entire  100  bases: 


480 

where  V  M  is  the  year's  consumption  at  the  40  bases  (representing 
1  1 200 

480  clusters),  and  7  PL  is  the  consumption  for  all  100  bases  (equiv¬ 
alent  to  1200  clusters).  The  proportion  consumed  by  aircraft  #1  is  esti¬ 
mated  by: 


The  sampling  variance  is  the  same  for  both  P^  and  P and  is  estimated 

,  .  u  .  „2 

by  weighting  S.,.: 
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S 


2 

P 


Subsampling 

Subsampling,  or  two-stage  sampling,  is  a  hybrid  of  cluster  and 
stratified  sampling.  The  population  is  partitioned  into  N  primaries, 
and  n  of  these  primaries  are  randomly  selected.  A  subsample  of  m 
secondaries  is  then  randomly  selected  from  each  primary.  This  tech¬ 
nique  is  sometimes  extended  to  three  or  four  stages.  The  discussion 
that  follows  will  consider  the  case  where  each  primary  contains 
the  same  number  (M)  of  secondaries,  and  fie  same  number  of  secondaries 
(m)  are  sampled  from  each  primary. 


n  4 
M  6 

m  2 


The  main  advantage  of  subsampling  over  one-stage  sampling  is 
flexibility.  It  reduces  to  cluster  sampling  when  m  =  M,  or  to  strati¬ 
fied  sampling  when  n  ■  N;  but  in  terms  of  the  cost-precision  tradeoff, 
a  scheme  that  falls  somewhere  between  these  two  may  be  preferable. 

The  problem  is  to  determine  values  of  n  and  m  such  as  to  minimize 
sampling  variance  for  a  given  cost  (or  equivalently,  to  minimize  cost 
for  a  specified  variance).  Appendix  II  provides  a  method  for  solving 
this  problem  that  requires  preliminary  estimates  of  (1)  the  cost  of 
sampling  associated  with  each  cluster  (c^),  (2)  r he  cost  of  sampling 
secondaries  within  clusters  (c9)  ,  (.)  the  variance  between  cluster  means 

2  i 

(SD) ,  and  (4)  the  variance  of  secondaries  within  clusters  (S  ). 

B  w 
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For  this  purpose,  these  estimates  do  not  require  great  precision 

because  the  sampling  variance  is  not  highly  sensitive  to  the  choice 

2  2 

of  m.  It  is  usually  easier  to  estimate  ratios  c,/c„  and  S  /S  ,  in 

1  L  W  D 

which  case  tables  are  available  to  aid  the  evaluation  of  m  (see  Coch¬ 
ran,  page  282). 

Example.  Since  the  greater  portion  of  USAF  base-level  reporting 
systems  have  been  designed  primarily  for  management  and  control  pur¬ 
poses,  the  needs  of  the  planning  and  programming  oriented  cost  analyst 
are  not  always  satisfied;  it  has  generally  been  more  expedient  to  put 
accountability  on  an  organizational  basis  rather  than  a  program  basis. 
Certain  base-support  organizations  provide  service  to  a  plurality  of 
programs,  and  in  order  to  allocate  activity  on  a  program  basis,  the 
cost  analyst  must  often  adopt  some  arbitrary  pro-ration  scheme. 

The  following  example  suggests  how  a  subsaraplirig  design  might  be 
used  to  estimate  the  average  daily  man-hours  devoted  by  Civil  Engineer¬ 
ing  squadrons  to  repair  and  maintenance  of  aircraft  alert  facilities 
durirg  a  90-day  period.  It  is  assumed  that  a  daily  record  of  work- 
orders  is  maintained  in  a  general  ledger,  and  that  inspection  of  the 
ledger  will  provide  the  data  needed. 

Assume  that  there  are  126  C-E  squadrons  overseas  and  in  the  Z.I.; 
each  of  these  will  be  regarded  as  a  primary  cluster.  Each  cluster 
consists  of  90  days  of  information.  The  procedure  will  be  to  select 
n  squadrons  at  random,  then  select  m  days  within  each  cluster.  The 
total  man-hours  devoted  to  aircraft  alert  facilities  maintenance  during 
the  selected  squadron-days  will  be  found  by  examining  the  appropriate 
ledger. 

The  first  problem  is  to  decide  the  optimum  value  of  m.  This  re- 

2  2 

quires  'guesstimates"  of  S  ,  S„,  c, ,  and  c_.  Since  there  are  about 

w  B  1  2 

two  to  300  entries  per  day  in  each  squadron's  ledger,  an  allowance  of 
four  hours  per  squadron-day  seems  reasonable.  The  cost,  of  visit¬ 
ing  each  squadron  would  be  in  the  neighborhood  of  one  and  a  quarter 

2  2 

"working"  days,  or  10  hours.  S'  and  Sn  are  considerably  more  elusive, 

W  n 

but  suppose  that  examination  of  ledgers  from  two  or  three  representa¬ 
tive  squadrons  suggests  230  and  40,  respectively;  the  optimum  subsampie 
size  (m)  is  then  determined  as; 


-43- 


The  number  of  clusters  selected  (n)  can  be  determined  in  one  of  two 
ways,  depending  on  whether  total  cost  or  overall  precision  is  held 
constant.  Suppose  the  total  time  allocated  to  che  collection  of  data 
is  set  at  40  workdays  (320  hours) : 


C  *  nc^  +  nmC2 


320  =  n(10)  +  n(6) (4) 

n  =  9.4  *  9 


Ifj  on  the  other  hand,  one  can  tolerate  a  sample  mean  variance  of  about 
5  man-hours,  the  following  formula  is  solved  for  n: 


2  2 
S  S 

~  (1-f.)  +  £  (l-f-)f, 

n  L  nm  l.  i 


5 


n 


126 


)  + 


230 

n(6) 


(1 


6'  n 
*90'  126 


n  =  7 . 1  °  7 


Systematic  Sampling 

Systematic  sampling  is  not  so  much  a  sampling  "technique"  as  it 
is  a  refinement  in  the  use  of  random  numbers.  It  is  discussed  here 
because  it  often  produces  the  same  effects  as  stratification  or  clus¬ 
tering,  and  because  it  is  almost  an  indispensable  device  when  sampling 
from  very  Large  frames. 
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The  procedure  begins  with  the  decision  to  sample  some  fraction  of 
the  population,  say  1/12.  The  population  is  listed  and  a  random  num¬ 
ber  is  selected  between  1  and  12,  say  8.  For  the  sample,  the  eighth 
unit,  and  every  twelfth  unit  thereafter,  are  selected  (i.e.,  #8,  #20, 
#32,  #44,  etc.). 


Systematic  sampling  has  two  advantages  over  simple  random  sampl¬ 
ing.  First,  it  is  easier  to  draw  the  sample,  since  only  one  random 
number  is  required.  Second,  it  distributes  the  sample  more  evenly 
over  the  population  and  therefore  often  provides  more  accurate  results. 

There  are  also  two  potential  disadvantages.  If  the  population 
contains  some  periodic  variation,  and  the  sampling  interval  coincides 
with  that  variation,  the  sample  obtained  may  be  badly  biased.  Second, 
evaluation  of  sampling  variance  is  contingent  on  knowing  the  behavior 
of  the  population  with  respect  to  the  listing. 
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IV.  REFINEMENTS  IN  THE  ESTIMATOR 

The  techniques  previously  discussed  each  dealt  with  some  way  of 
partitioning  the  population  preliminary  to  drawing  the  sample.  The 
estimator  of  the  population  mean,  X,  was  the  same  in  all  cases:  the 
simple  or  weighted  average  of  the  sampled  X^. 

There  are  many  sampling  situations  where  there  exists  some  "auxil¬ 
iary"  variable  which  is  known  to  correlate  with  the  variable  of  in¬ 
terest.  In  such  cases,  sampling  variance  can  be  reduced  by  instituting 
a  basic  change  in  the  estimator  so  as  to  take  advantage  of  the  informa¬ 
tion  contained  in  the  auxiliary  variable.  This  is  the  case  with  ratio 
and  regression  estimation,  which  are  explained  in  this  section.  A 
third  technique,  unequal  probability  sampling,  uses  the  auxiliary 
variable  in  determining  selection  probabilities  as  well  as  in  the 
estimator. 

The  format  for  this  section  is  similar  to  that  of  Section  III, 
although  the  more  complex  designs  inherently  require  more  formula¬ 
tions  in  their  descriptions.  A  summary  of  the  fundamental  characteris¬ 
tics  of  all  the  sampling  techniques  described  in  this  document  concludes 
this  section. 

RATIO  ESTIMATOR 

In  ratio  estimation,  two  variables  are  observed  on  each  sample 

unit:  X^,  the  variate  of  interest,  and  W^,  an  auxiliary  variable. 

The  auxiliary  variable  is  such  that  its  population  mean,  y  ,  is  known. 

w 

The  ratio  estimate  of  the  population  mean  of  the  X^  is  given  by: 


The  ratio  estimator  is  biased,  except  in  the  situation  where  a 
regression  of  X  on  W  would  be  a  straight  line  through  the  origin  (i.e,, 
the  ratio  X^/'W^  is  approximately  constant).  The  bias  is  negligible 
in  large  samples. 
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The  sampling  distribution  is  hard  r.o  pin  down,  since  both  X  and 
W  vary  from  sample  to  sample.  However,  for  large  sanples,  the  distri¬ 
bution  tends  to  normal  and  the  bias  in  the  approximate  variance  formula 
becomes  negligible. 

In  spite  of  these  difficulties,  ratio  estimation  can  be  a  very 
useful  way  to  use  extraneous  information  that  is  not  directly  of 
interest  to  the  analyst.  If  this  extra  information  is  easily  picked 
up  with  the  regular  sample,  the  gain  in  precision  is  cheap,  since  only 
the  final  computations  are  affected. 

Knowledge  of  the  exact  relationship  between  X  and  W  is  not  re¬ 
quired,  but  in  order  for  the  precision  of  the  ratio  estimate  to  be 
greater  than  a  simple  sample  mean,  it  is  necessary  that  the  following 
condition  holds: 


CV 


w 


xw  2CV  * 
x 


CV  =  o  /p, 
w  w  w 


CV  =  a  /p 

X  X  X 


where  p  is  the  correlation  coefficient  between  X  and  W,  and  CV  and 
xw  x 

CVw  are  the  coefficients  of  variation  for  X  and  W,  respectively. 

The  variability  of  the  auxiliary  variate,  W,  is  thus  an  important 
factor;  if  its  coefficient  of  variation  is  more  than  twice  that  of  X, 
the  ratio  estimate  is  always  less  precise,  since  cannot  exceed  1. 
The  preceding  result  is  based  on  the  approximate  variance  formula 
and  therefore  is  applicable  to  large  samples;  for  email  samples,  the 
condition  would  be  more  stringent,  since  the  approximate  formula  is 
usually  an  underestimate. 


Example 

A  common  use  of  the  ratio  estimator  occurs  when  there  has  been 
a  complete  census  of  the  particular  variable  of  interest  in  some  pre¬ 
vious  time  period.  Suppose  it  is  desired  to  estimate  the  current 
average  inventory  of  fuel  at  USAF  air  bases,  and  that  for  purposes  of 
the  example  these  data  are  available  or.  a  base-by-base  basis  only  as  of 
the  end  of  the  previous  year.  Let  X^  be  the*  current  inventory  and 
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the  previous  inventory  at  the  iC^  base  in  the  sample.  The  population 
average  as  of  the  end  of  the  previous  year  will  be  indicated  by  p,^. 

Before  applying  the  ratio  estimator,  it  will  be  prudent  to  de¬ 
termine  its  usefulness  compared  with  a  simple  sample  mean.  It  is 
reasonable  to  assume  that  the  ratio  estimator  will  be  unbiased  (i.e., 
X^/W^  is  constant)  since  a  force-wide  adjustment  in  fuel  inventories 
would  probably  derive  from  some  implicit  general  policy  change  that 
has  proportional  effects  on  all  bases.  A  quick  check  of  this  assump¬ 
tion  can  be  made  by  plotting  X  against  W,  noting  whether  a  freehand 
regression  line  passes  through  the  origin.  For  example: 


(If  the  regression  line  does  not  pass  through  the  origin,  and  the  sample 
is  not  large,  it  would  be  preferable  to  consider  the  regression  estima¬ 
tor  as  described  in  later  pages.)  Attention  is  next  directed  to  whether 
the  ratio  estimator  is  more  precise  than  the  simple  mean,  using  the  cri¬ 
terion  p  >  ^(CV  )/(CV  ),  In  this  case,  CV  and  CV  are  probably  the 
same,  since  w  and  X  are  essentially  the  same  variable.  So  the  question 
reduces  to  whether  p  is  greater  then  one-half,  which  does  not  seem  un- 
reasonable  unless  base  fuel  inventories  fluctuate  widely  over  time.  A 
quick  check  is  provided  by  observing  whether  the  free-hand  regression 
line  seems  to  "explain"  more  than  one-half  the  variation  in  X. 

If  the  foregoing  analysis  establishes  the  ratio  estimator  as  ap¬ 
propriate,  estimates  of  the  mean  and  variance  proceed  according  to 
the  formulas  given  in  Appendix  II.  Supposing  there  are  150  air  bases 
In  the  population  from  which  20  are  sampled,  the  calculations  might 
proceed  according  to  t lie  following  worksheet  (inventories  are  expressed 
In  thousand  of  barrels): 
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Se lection 

Sam: 
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No. 
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Wi 

v 

Xi 
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2 
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3 

56 
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-- 
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1 

9 
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19 
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1 
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7 

2 

t 

i 

72 

♦ 

81 

1 
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51 

-- 

-- 

-- 

Totals 

8250 

940 

1080 

592.3 

H  =  825C/150  -  55 
w 

W  =  940/20  =  47 
X  *  1080/20  «  54 

The  ratio  estimate  or  average  fuel  inventory  is: 

The  variance  of  the  ratio  estimate  is  estimated  by: 


si 


1-f 


n(n-l) 


r 

iJ 


20 

L_lll2(592  3) 

20CL9T  *  ^ 


1.4 


This  example  has  not  included  any  discussion  of  how  the  data  are  to  be 
collected.  This  simplest  case  would  be  a  simpl-'  random  selection  of 
beses,  hut  there  is  no  reason  why  stratified  cr  cluster  sampling  should 
not  be  used,  if  the  characteristics  of  the  population  warrant  it.  In 
the  present  example,  it  would  probably  be  useiul  to  stratify  by  major 
conmand,  since  base  fuel  consumption  should  be  significantly  more 
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homogeneous  within  commands  than  between  commands ,  Remembering  that 
the  stratified  estimator  of  the  population  mean  is  just  a  weighted 
average  of  stratum  means,  the  stratified-ratio  estimator  can  be  written 
as: 


where  and  are  simple  means  of  bases  sampled  from  the  i  com¬ 
mand.  The  estimator  of  variance  will  be; 


Thus,  even  a  simple  marriage  of  two  sampling  techniques  complicates 
estimation  of  variance.  This  problem  is  discussed  in  a  general  way 
under  the  heading  Complex  Designs ,  beginning  on  page  57. 

REGRESSION  ESTIMATOR 

The  regression  estimator  is  mor.  appropriate  than  the  ratio  esti¬ 
mator  if  the  relation  between  X  and  W  is  linear  but  does  not  go  through 
the  origin.  In  this  caset  the  estimate  of  the  population  mean  is: 

X  -  X  +  bfci  -W) 
r  v 

where  b  is  an  estimate  of  the  change  in  X  when  W  Is  increased  by  l. 

The  reasoning  is  that  if  the  sample  U  is  below  average,  one  could  ex¬ 
pect  the  sample  X  to  also  be  below  average  by  an  amount  o(uw-W),  The 
value  of  b  is  usually  estimated  from  the  sample  using  the  least-squares 
estimator: 


n 

v,  (Xj-X)  (W^-W) 


-  2 
r(wi-w)t 


b 
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Contrary  to  the  case  in  general  regression  analysis,  it  is  not  neces¬ 
sary  to  assume  exact  linearity  between  X  and  W,  nor  that  the  variance 
of  X  for  a  given  W^  is  constant  (again,  provided  the  sample  size  is 
large). 

As  with  ratio  estimates,  the  regression  estimate  is  generally 
biased.  But  for  large  samples,  the  ratio  of  bias  t<  standard  error 
becomes  small,  making  the  bias  negligible.  Furthermore,  there  is  no 
bias  if  an  exact  linear  relationship  exists  between  X  and  W.  What 
constitL .  c  a  "large"  sample  depends  on  how  X  and  V  are  correlated, 
and  cannot  be  summarized  by  a  rule  of  thumb. 

For  large  samples,  the  regression  estimate  is  more  precise  than 
the  simple  sample  mean  provided  that  there  is  some  correlation  between 
X  and  W;  it  is  more  precise  than  the  ratio  estimate  unless  the  rela¬ 
tion  between  X  and  W  is  a  straight  line  through  the  origin.  Thus, 
there  is  nothing  to  lose  in  using  regression  estimator  except  the 
extra  time  spent  in  calculation. 

Example 

An  interesting  application  of  the  regression  estimator  ia  the  use 
of  "eyeball"  estimates  for  the  auxiliary  variables.  For  exanq>le,  sup¬ 
pose  there  is  a  proposal  to  replace  some  training  equipment  at  an  air 
base,  but  it  is  first  necessary  to  assess  the  salvage  value  of  the  old 
equipment.  The  analyst,  or  a  salvage  expert,  would  quickly  survey 
each  item  of  equipment,  roughly  estimating  its  approximate  salvage  value. 
Then  a  random  sample  would  be  selected,  and  the  exact  salvage  value 
of  each  sampled  item  determined  by  close  inspection.  The  regression 
estimator  is  chen  applied,  labeling  the  individual  rough  estimates 
W^,  the  average  of  all  rough  estimates  |j,w,  and  the  more  thorough  esti¬ 
mates  X^. 

Supposing  the  population  contains  120  items  of  equipment  and  a 
sample  of  20  Is  to  be  drawn,  the  following  analysis  might  result: 
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Population 

Data 

Item 

No. 


Sample 
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25 

25 
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v 
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T~ 

14 

i 

15 

V 
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i" 
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r 
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]* 

1 
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1 

176 

l 

-33 

1 

-35 

1 

1155 

1 

1225 

1 

54.8 

4300 

4180 

0 

0 

10350 

14083 

605.8 

H  =  23,760/120  *  198 

v 

W  =  4300/20  =  215 

X  =  4180/20  =  209 

^(X  -X)(W  -W) 

b  „ - 1 - — i -  10,350/14,083  =» 

VQ1  -V)* 


The  regression  estimate  of  average  salvage  value  is  given  by: 


Xr  =  X  +  bGiw-W)  =  209  +  .73(198-215)  =  196.6 


The  variance  of  this  estimate  is  estimated  '>y 


*)  ft  If  ^  ^  O 

%  -  ntjfey1  Ux1-x)-b(w1-5)] 


2ot'rf)(605-8>  - l-6 


Although  the  rough  estimates  (W^)  are  biased,  one  could  expect 
the  bias  to  be  constant  from  item  to  item,  except  for  random  variation. 
If  this  random  variation  is  not  too  great,  the  regression  estimator 
will  be  unbiased  for  small  samples.  For  this  reason,  it  is  important 
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chat  the  sane  person  make  all  the  rough  estimates.  It  Is  also  impor¬ 
tant  that  this  person  not  know  what  items  fall  into  the  sample  until 
the  rough  estimates  have  been  mede.  Provided  the  latter  condition 
holds,  the  consistency  of  rough  estimates  can  be  checked  by  plotting 
the  sample  X^  versus  the  W^. 


PROBABILITY  SAMPLING 


This  technique  utilizes  an  auxiliary  variable  in  determining  selec¬ 
tion  probabilities  as  well  as  in  a  special  estimator.  As  previously 
mentioned,  the  idea  is  to  find  a  variable  which  is  closely  correlated 
with  the  particular  variable  of  Interest.  Probabilities  of  selection 
are  set  proportional  to  the  former,  the  sample  is  collected,  and  the 
following  estimator  is  used: 


where  X.  is  the  variable  of  interest,  W.  is  the  auxiliary  variable, 

Wi 

and  E.  =  —  is  the  probability  of  selecting  X. . 

Unequal  probability  sampling  is  a  great  aid  in  increasing  preci¬ 
sion,  when  an  auxiliary  variable  with  the  proper  characteristics  is 
available.  The  technique  has  received  much  attention  in  the  past  ten 
years  or  so  despite  problems  in  application.  For  example,  in  replace¬ 
ment  sampling  the  calculation  of  variance  is  straightforward.  When 
sampling  with  nonreplacement,  however,  there  are  problems  of  control¬ 
ling  tha  Pt  and  estimating  variance  that  are  beyond  the  scope  of  this 
paper.  Furthermore,  the  exact  form  of  the  sampling  distribution  is  not 
known.  Suffice  it  to  say  that  gains  are  to  be  made  when  X  and  W  are 
closely  correlated,  but  that  the  complete  theory  of  this  kind  of  sampl¬ 
ing  is  still  being  developed  in  current  research  literature  (see 


An  alternative  way  to  arrive  at  rough  estimates  is  for  the  ana¬ 
lyst  to  develop  an  estimating  relationship  on  the  basis  of  historical 
information,  using  such  parameters  as  original  cost,  age,  and  usage  rate. 


-53- 


bibllography) .  A  simplified  example  will  be  given  to  illustrate  the 
power  of  the  technique. 

Example  I 

Suppose  it  is  decided  to  estimate  the  total  personnel  stationed 
on  five  military  installations  in  some  remote  region  of  Northern 
Canada,  using  a  sample  of  three.  It  is  expected  that  the  average 
number  of  personnel  presently  stationed  at  each  installation  (vari¬ 
able  X)  is  closely  correlated  with  the  average  number  of  personnel  of 

the  previous  vear  (variable  W) ,  the  data  for  which  are  known.  The  pro- 

n 

cedure  is  to  choose  three  random  numbers  between  1  and  £  W^,  making 
the  selection  of  sample  points  on  the  basis  of  a  cumulative  list  of 
variable  W.  Hence: 


Base 

U 

"i 

Cumulative 

W 

"i 

Random 

Number 

Xi 

1 

22 

22 

14 

25 

2 

36 

58 

37 

3 

21 

79 

62 

29 

4 

34 

113 

97 

34 

5 

11 

124 

12 

Total 

124 

137 

The  usual  estimate  for  simple  random  sampling  would  be; 

NX  -  ^  -  ^(X2  +  X3  +  X4>  -  |(25  +  29  +  34)  -  152 
The  unequal  probability  estimate  is: 
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-  l/3[25(i||)+  29(±§£)  +  34  (^)J  -  145 


Wi 

where  is  the  probability  with  which  the  sample  point  entered 

the  sample.  * 

There  are  a  total  of  ten  possible  samples  of  size  three  that 
could  be  drawn  from  this  population.  The  following  table  compares 
the  simple  random  sample  estimate  with  the  unequal  probability  esti¬ 
mate  for  each  case; 


Bases  sampled 

123 

124 

125 

134 

135 

145 

234 

235 

245 

345 

Simple  random 
Unequal  prob¬ 

152 

160 

123 

147 

110 

118 

167 

130 

138 

125 

ability 

147 

131 

135 

145 

149 

133 

141 

145 

129 

143 

Except  for  two  cases,  the  unequal  probability  estimate  is  closer  to 
the  population  value  of  137.  The  standard  error  for  the  unequal  prob¬ 
ability  estimates  is  7.3,  whereas  that  for  simple  random  sampling  is 
17.8.  If  W  and  X  **ere  more  closely  correlated,  one  would  expect  even 
better  results. 

Example  II 

Sampling  with  unequal  probabilities  is  often  used  to  yield  a 
"self-weighting"  sample  in  cluster  sampling  or  subsampling  when  the 
clusters  are  of  unequal  size.  This  is  the  context  in  which  Hansen 
and  Hurwitz  first  introduced  the  technique. 

When  sampling  n  clusters  from  a  total  population  of  N  clusters, 
where  the  cluster  size,  ,  is  the  same  for  each  cluster,  the  popula¬ 
tion  mean  is  estimated  by  averaging  the  cluster  means: 


However,  if  cluster  size  varies  from  clust  cluster,  a  weighted 

estimator  would  be  more  precise; 


where  M  is  the  average  cluster  size.  This  expression  can  be  manipu¬ 
lated  as  follows: 


where  is  the  probability  of  selecting  the  cluster.  So  far, 
all  of  the  Pt's  have  been  che  same.  Suppose,  however,  that  each  p^ 
is  made  proportional  to  its  corresponding  M^.  The  probability  of 
selecting  each  cluster  is  then  n(M^/MN).  Substituting  this  into  the 
above  formula  gives: 


which  is  the  same  as  the  simple  unweighted  estimator.  Thus,  the 
sample  is  said  to  be  self*weighting:  X  is  the  appropriate  estimator 
for  X  even  though  cluster  sizes  vary. 

The  technique  for  selecting  the  clusters  with  unequal  probabili 
ties  is  the  same  as  outlined  before,  except  that  the  basis  for  selec 
tion  is  now  a  cumulative  list  of  cluster  sizes,  rather  than  the 
auxiliary  variable.  The  worksheet  for  such  a  sample  might  have  the 
following  format: 


The  estimated  population  mean  is  then: 

Xci  =  |(105  +  92  +  112)  -  103. 


COMPARISON  OF  DESIGNS 

The  task  of  compiling  some  sort  of  quantitative  comparison  of  the 
foregoing  sample  designs  is  not  realistic ,  since  so  much  depends  on 
the  characteristics  of  the  particular  population  under  study  (see  Des 
Raj,  Zarkovich).  It  may  be  helpful  to  briefly  categorize  the  attri¬ 
butes  of  the  various  designs  and  estimation  procedures  as  they  relate 
to  accuracy  and  cost: 


Simple  random  sample  o  Simplest  design* 

Stratified  sampling  o  Nearly  always  more  precise  than 

simple  random  sample. 

Cluster  sampling  o  Simpler  frame  and  reduced  travel 

costs. 

o  Usually  less  precise  then  simple 
random  sample. 

Sub-scmpling  o  Flexibility  in  balancing  coat-preciaion 

trade-off,  especially  when  convenient 
cluster  sice  is  too  small  for  Grati¬ 
fication  and  too  large  for  cluster 
sampling. 

Systematic  sampling  o  Ease  in  selecting  sample  points. 

o  May  give  better  representation,  de¬ 
pending  on  frame. 

o  May  be  biased,  depending  on  frame. 
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Ratio  estimator 


Regression  estimator 


Unequal  probabilities 


o  Usually  more  precision  than  simple  estimator, 
o  "Significant"  bias  In  small  samples. 

o  More  precision  than  simple  estimator, 
o  "Significant"  bias  in  small  samples. 

o  Usually  more  precision  than  equal  probabilities, 
o  Difficult  to  assess  sampling  error. 


Usually,  the  circumstances  of  the  analysis  readily  suggest  the 
most  appropriate  design  or  combination  of  designs.  For  example,  USAF- 
wide  sampling  immediately  leads  to  the  possibility  of  stratification 
by  major  command  or  some  geographical  classification;  also,  the 
presence  of  a  convenient  auxiliary  variable  makes  ratio  or  regression 
estimators  attractive. 

On  the  other  hand,  there  are  often  factors  to  consider  that  do 
not  readily  fit  into  the  framework  of  cost-precision  tradeoffs.  One 
such  factor  is  the  need  to  minimize  the  imposition  of  field  work  on 
USAF  perso.  ;el  who  have  other  responsibilities  (e.g.,  maintenance 
chiefs  or  accounting  clerks);  the  essence  of  sample  work  is  loyal 
adherence  to  good  procedure,  and  it  is  often  more  fuss  than  the  busy 
serviceman  can  handle.  More  often  than  not,  the  situation  will  be 
such  that  there  is  no  completely  objective  approach  to  designing  the 
sample. 

In  any  case,  the  general  procedure  is  the  same  as  with  all  prob¬ 
lem  solving:  specify  the  objectives,  survey  factors  related  to  the 
problem,  identify  alternative  solutions,  quantify  the  problems  as 
much  as  possible  to  reduce  subjective  uncertainty,  and  make  such  in¬ 
tuitive  decisions  as  are  necessary. 


This  Section  gives  recognition  to  some  topics  which  often  erlse 
in  the  applicetion  of  sampling  techniques  and  which  seem  particularly 
relevant  to  the  forecasting  nature  of  coat  analysis.  Complex  sample 
design  is  discussed  and  an  example  of  a  sautpllug  study  recently 
conducted  by  the  Cost  Analysis  Division,  Headquarters  Strategic  Air 
Command,  is  described.  The  Section  concludes  with  a  discussion  of  the 
application  of  sampled  data  in  regression  analysis. 

COMPLEX  DESIGNS 

Theory  surrounding  the  subject  of  complex  sample  designs  is  gen¬ 
erally  less  developed  than  for  the  basic  designs,  and  documentation  of 
research  is  highly  fragmented  among  various  Journal  publications  and  a 
few  books. 

TWo  kinds  of  complexity  are  worth  noting:  (1)  compounding  of  de¬ 
sign  and  (2)  compounding  of  purpose. 


Sometimes  the  characteristics  of  the  population  are  such  that  it 
is  convenient  to  compound  the  various  basic  designs.  Drawing  on  the 
example  in  the  previous  section  where  the  salvage  value  of  some  train¬ 
ing  equipment  was  estimated,  suppose  a  US AF -wide  estimate  was  desired. 
The  most  simple  scheme  might  be  to  select  a  simple  random  sample  of  30 
fcxsas,  then  apply  the  regression  estimator  within  each.  A  more  precise 
estimate  might  be  achieved  by  designing  a  "complex"  sample  along  the 
following  linos:  (1)  stratify  bases  on  a  two-way  scheme  using  major 
commend  and  geography  as  classifications,  resulting  In  about  13  strata; 
(2)  select  two  bases  for  each  strata  with  unequal  probabilities,  using 
numb* r-of -airmen  as  the  auxiliary  variable;  (3)  sub-sample  several  items 
of  equipment  from  each  base;  (A)  estimate  the  total  salvage  value  for 
each  base  with  a  regression  estimator,  using  the  salvage  expert's  "eye¬ 
ball"  estimates  as  the  auxiliary  variable.  The  total  USAP  estimate 


would  be  given  by: 


where  and  are  the  two  eatlnated  bate  totals  from  each  stratum, 
and  and  are  their  respective  probabilities  of  selection.  It 
would  be  extremely  difficult  to  estimate  what  the  sampling  variance 
from  such  a  design  would  be.  Ihe  rationale  for  the  procedure  wee  gen¬ 
erated  by  reasoning  subjectively  at  each  stage  of  the  design  that  some 
particular  technique  would  contribute  most  to  precision  in  the  final 
eatimete.  Although  an  objective  estimate  of  the  sampling  error  is  not 
known,  an  upper  limit  may  be  set  by  computing  the  error  that  would  re¬ 
sult  from  a  less  complex  design. 

There  are  simplified  methods  for  computing  the  sampling  error  once 
the  sample  has  been  drawn.  One  way  is  to  use  the  technique  called  rep¬ 
lication.  Instead  of  drawing  the  entire  aample  in  one  operation,  only 
a  fraction  of  the  sample  points  are  drawn,  and  the  procedure  la  repeated 
until  the  total  sample  la  drawn.  The  variance  of  the  sampling  distribu¬ 
tion  is  then  computed  from  the  several  estimates.  The  example  above, 
fur  example,  might  be  completed  in  three  separate  samples,  the  differ¬ 
ence  being  that  for  each  sample,  only  one-third  aa  many  units  of  train¬ 
ing  equipment  are  selected  from  each  base.  The  overall  sample  sire  is 
still  about  the  same.  Sampling  variance  is  estimated  as  follows: 

4  ■  £v*>2 

when  X  (l  »  1,  2,  3)  is  the  estimate  from  the  1th  sample  and  X  la  the 
average  of  the  three  samples.  The  use  of  replication  in  sample  design 
is  treated  extensively  in  Darning.  Discussion  of  other  methods  is  found 
in  Zarkovich. 

Multi-Purpose  Surveys 

Survey  design  has  been  described  as  a  process  of  • o lusting  alter¬ 
native  methods  in  terms  of  relative  cost  end  preclslo  This  Is  rea¬ 
sonably  straightforward  vher  only  one  characteristic  '  under  measurement. 
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It  often  happens  that  the  sampler  takes  advantage  of  the  situation  by 
observing  several  characteristics  instead  of  one.  For  example,  a  sur¬ 
vey  of  airmen  mav  involve  observations  on  age,  training,  motivation, 
and  rank.  Determination  of  optimum  design  is  now  considerably  less  ob¬ 
jective.  The  proper  sample  site  for  one  characteristic  may  provide  no 
useful  information  on  a  second  characteristic,  and  give  superfluous 
precision  on  a  third.  One  characteristic  may  be  perfectly  suited  for 
a  stratified  design  while  a  companion  characteristic  is  more  adaptable 
to  something  else.  Objectivity  requires  the  assessment  of  the  relative 
utility  of  the  different  information  sought.  These  problems  are  dis¬ 
cussed  in  Kish  and  in  Yates. 

A  related  complexity  is  found  in  the  so-called  "analytic"  surveys. 
The  objective  of  an  analytic  survey  generally  is  to  make  comparisons 
between  sub-populations,  where  the  sub-populations  cannot  be  framed 
(l.e.,  sampling  units  can  be  identified  by  sub-population  only  after 
the  sample  is  taken).  Referring  to  the  sample  on  page  42,  (sub-samp- 
ling  from  Civil  Engineering  Squadron  work-order  ledgers),  the  purpose 
might  well  have  been  to  compare  the  resources  devoted  to  several  pro¬ 
gram  categories.  Each  squadron-day  selected  in  the  sample  would  con¬ 
sist  of  work-orders  in  one  or  more  categories,  leading  to  estimates  of 
total  activity  devoted  to  each  program.  So,  in  effect,  several  samples 
are  being  conducted,  one  for  each  sub-population.  The  feature  that 
makes  this  different  from  othar  procedures  heretofore  discussed  is  that 
the  sample  else  from  each  sub-population  ia  also  a  variable.  Further¬ 
more,  the  sample  slses  are  negatively  correlated  and  cannot  be  treated 
as  independent  variablea.  Procedures  for  handling  this  situation  ara 
discussed  in  Yates  and  in  Hartley  (the  latter  reference  la  probably  the 
more  straightforward) .  The  generel  problem  of  analytical  statistics 
from  complex  camples  is  simnnrised  by  Kish  (pages  562-587),  including 
a  brief  description  of  seven  approaches  to  computing  or  approximating 
standard  errors. 

Example:  SAC  Aircraft  Maintenance 

The  example  that  follows  is  a  rather  detailed  description  of  a 
research  program  recently  undertaken  in  SAC.  The  project  is  of 
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direct  interest  here  because  (1)  it  illustrates  a  rather  complex 
sample  design  problem,  and  (2)  it  provides  a  case  study  of  data  col- 
lection  at  a  level  of  aggregation  useful  to  a  cost  analyst. 

Tie  Cost  Division  at  SAC  Headquarters  used  probability  sampling 
to  collect  somfc  maintenance  man-hour  and  material  cost  data  on  the 
KC-135,  the  B-52  (G  and  H  series),  and  the  UH-1F.  Sampling  was  ne¬ 
cessitated  by  the  desire  to  obtain  data  from  the  original  source 
documents,  a  procedure  involving  considerable  effort;  probability  selec¬ 
tion  was  chosen  because  of  a  preference  for  unbiased  estimates  and  be¬ 
cause  there  ves  no  apparent  basis  for  assuming  a  more  precise  judgment 
sample.  Since  the  ‘lection  procedures  are  similar  for  all  three  air¬ 
craft,  emphasis  will  be  on  the  KC-135  sample. 

Motivation.  The  project  had  three  primary  objectives.  The  first 
and  most  important  was  to  evaluate  the  general  behavior  of  maintenance 
requirements  as  an  aircraft  ages.  Current  aircraft  costing  models 
often  assume  (at  least  Implicitly)  that  maintenance  cost3  slope  down¬ 
ward  during  the  initial  months  following  deployment  into  the  force,  then 
level  off  after  "shake-down"  is  accomplished.  SAC  cost  analysts  hypoth¬ 
esize  that,  instead  of  leveling  off  indefinitely,  costs  tend  to  rise 
again  ae  the  equipment  gets  older.  There  is  considerable  interest  in  re 
solving  the  question  since  (1)  there  is  some  uncertainty  as  to  when  the 
strategic  aircraft  in  the  force  will  be  replaced,  and  (2)  the  proposed 
Rseources  Management  System  has  suggested  changes  in  military  management 
thet  could  triple  SAC's  responsibility  in  programming  and  budgeting  for 
maintenance  resources. 

The  second  objective  of  the  sample  was  to  explore  the  relationship, 
if  eny,  between  maintenance  man-hours  end  other  maintenance  costs.  It 
is  common  practice  to  pro-rate  bees  maintenance  costs  jmong  the  various 
aircraft  systems  on  the  basis  of  man-hours.  Since  some  systems  require 
relatively  greeter  parte  requirements  then  others,  the  validity  of  such 
practice  is  questionable. 

The  third  objective  was  to  investigate  errors  in  recording  end  re¬ 
porting  maintenance  materiel  consumption.  There  is  evidence  that  parts 
data  ere  sometimes  treated  in  cavalier  fashion  from  crew  level  on  up  the 


line  Co  final  reporting.  If  sources  of  error  can  be  identified  and  sub¬ 
sequently  reduced,  the  maintenance  data  will  have  greater  utility  for 
financial  planning. 

The  Population.  The  population  to  be  sampled  consisted  of  the  doc¬ 
uments  on  which  maintenance  personnel  record  their  work  (AFTO  210,  211 
and  212).  This  includes  parts  and  labor  expended  by  the  base  mainte¬ 
nance  shops  (field  maintenance,  CA&E  maintenance,  etc.);  bench  stock 
items  are  omitted,  but  this  is  a  very  insignificant  portion  of  overall 
maintenance.  These  source  documents  are  easily  identified  by  aircraft 
tail-number. 

Design.  Sample  design  was  addressed  primarily  to  the  flrat  objec¬ 
tive  of  the  study,  and  the  other  two  were  more  or  less  regarded  as  by¬ 
products  of  the  first.  The  general  idea  was  to  obtain  estimates  of 
maintenance  labor  and  material  costs  for  each  of  several  age  groups, 
then  observe  whether  the  estimates  conform  to  the  hypothesised  curve 
(data  on  engines  would  be  recorded  separately  since  engines  move  around 
from  aircraft  to  aircraft).  Initial  delivery  dates  of  the  KC-135s  range 
from  1958  to  1965,  providing  eight  yearly  age  groups. 

From  the  standpoint  of  precision,  the  sample  design  should  assure 
good  representation  over  a  number  of  variables  besides  age  that  affect 
maintenance.  For  example,  some  variation  can  probably  be  associated 
with  base'-to-base  differences  in  climate  and  maintenance  management. 
Differences  in  flying-hour  programs  (e.g.,  ready  alert  vs.  regular  sta¬ 
tus)  are  likely  to  be  even  acre  significant. 

tfi'h  regard  to  cost  and  selection  control,  the  best  design  would 
cluster  the  aaintnnanoe  documents  by  airersft  tall  master  sines  this 
is  the  manner  in  which  they  are  filed  si  bass  lev*  .  Any  other  arrange¬ 
ment  would  involve  the  field  worker?  In  sample  selection  or  raquira 
much  additional  time  in  constructing  a  sampling  fr*ae.  By  designating 
the  sampling  unit  as  all  maintenance  performed  on  a  given  aircraft  with¬ 
in  a  given  time  interval,  sample  selection  can  be  accomplished  entirely 
at  SAC  Headquarters. 

Sine.-  little  wss  known  about  the  magnitude  cf  variance  that  could 
be  expected,  the  design  approach  was  to  decide  how  much  rlcm  could 
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be  expected  from  the  base  personnel  who  would  be  doing  the  field  work, 
then  select  as  many  aircraft  as  possible.  The  Cost  Analysis  Division 
at  SAC  Headquarters  has  at  its  disposal  the  pert-time  services  of  one 
man  at  every  SAC  base,  and  it  was  desired  that  each  man  be  equally  in¬ 
volved  in  the  study.  After  examining  the  work  Involved  in  searching 
for  the  documents,  copying  labor  and  parts  data,  and  searching  cata¬ 
logues  for  parts  costs,  it  was  decided  to  sample  one  KC-135  on  each 
base  over  a  two-month  period  (June  and  July,  1967).  Sin~e  the  number 
of  aircraft  varies  from  base  to  base,  this  plan  necessitated  sampling 
with  unequal  probabilities.  The  design  was  further  complicated  by  the 
fact  that  the  proportions  of  aircraft  in  the  different  age  groups  also 
vary  from  base  to  base. 

The  final  choice  was  a  two-stage  design,  with  the  first  stage  fol- 

•/f 

lowing  a  procedure  first  introduced  by  Goodman  and  Kish,  and  the  sec¬ 
ond  stage  using  simple  random  selection. 

For  the  first  stage,  primaries  were  designated  as  comprising  those 
aircraft  on  a  given  base  that  belong  to  the  same  age  group;  thus  with 
31  bases  and  8  age  groups,  there  was  a  maximum  of  248  (31  x  8  *  248) 
clusters.  Each  cluster  was  assigned  a  probability  of  selection  that 
is  roughly  proportional  to  the  number  of  aircraft  therein  (exact  pro¬ 
portionality  was  precluded  since  the  total  aircraft  per  base  varied). 

The  next  step  was  to  construct  21  "acceptable"  samples  such  that  each 
sample  contained  one  primary  from  each  base  and  at  least  one  primary 
from  each  age  group;  the  samples  were  simultaneously  assigned  probabil¬ 
ities  such  that  if  one  adds  up  the  probabilities  of  all  samples  in 
which  any  particular  primary  appears,  the  sum  will  equal  the  probability 
originally  assigned  that  primary.  Finally,  one  of  the  samples  was  ran¬ 
domly  chosen  with  probability  as  assigned. 

In  the  second  stage,  one  aircraft  was  chosen  at  random  from  each 
selected  primary. 

The  overall  effect  of  both  stages  was  to  select  a  sample  that  is 
stratified  according  to  base  and  "controlled"  by  age  group,  while  giving 


Goodman  and  Kish,  "Controlled  Selectlon--A  Technique  in  Probabil¬ 
ity  Sampling,"  Journal  of  the  American  Statistical  Association.  Vol.  45, 
pp.  350-372.  Alao  see  Kish,  Survey  Sampling.  1965.  nn.  488^496 
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all  aircraft  approximately  equal  selection  probabiliti.es  (the  probabil¬ 
ities  varied  from  about  ,03  to  .06),  The  price  of  having  such  a  con¬ 
trolled  sample  in  an  unbalanced  population  is  that  there  is  no  unbiased 
estimator  of  sampling  variance.  However,  a  weighted  estimator  for 
variance  is  available  that  leads  to  over estimation,  which  is  less  objec¬ 
tionable  than  underestimation.  Since  any  estimate  of  variance  is  itself 
subject  to  sampling  error,  the  bias  may  not  be  too  important.  In  any 
case,  the  bias  would  only  be  associated  with  the  sample's  first  stage, 
from  which  sampling  error  should  be  small  compared  to  that  from  the 
second  stage. 

In  analyzing  the  data,  separate  estimates  were  made  for  each  age 
group,  and  the  resulting  group  means  were  subjected  to  regression  ana¬ 
lysis  using  age  in  years  as  the  independent  variable.  The  use  of  group 
means  instead  of  the  raw  data  was  necessary  in  order  to  (1)  give  each 
age  group  equal  weight  in  the  regression  (sample  aircraft  were  unevenly 
allocated  among  age  groups)  and  (2)  to  accommodate  the  stratification 
and  probability  aspects  of  the  sampling— that  is,  to  help  dampen  other 
sources  of  variability  and  reveal  any  age-related  behavior.  Since  the 
age-group  means  were  derived  from  samples  of  various  sizes,  the  usual 
assumption  of  equal  variance  along  the  regression  line  was  clearly 
violated;  the  implication  is  loss  in  efficiency  in  obtaining  the  least- 
squares  fit. 

Some  initial  results  are  shown  below: 
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The  labor  hours  data  conformed  to  the  general  hypothesis  that  mainte¬ 
nance  increases  with  aircraft  age  and,  when  fitted  in  least-squares 
fashion  to  a  parabolic  curve,  survived  an  F-test  at  ,95  confidence 
(the  F-test  was  a  useful  bench-mark  despite  violation  of  some  underly¬ 
ing  assumptions);  analysis  of  mater i.al  costs  did  not  fare  well  at  this 
level  of  aggregation.  Subsequent  examination  entailed  a  closer  look 
at  individual  sample  aircraft  and  a  distribution  of  labor  hours  and 
material  according  to  maintenance  shops.  The  results  were  presented 
at  the  March  1968  OSD  Cost  Research  S'Tnp^sium. 

The  good  performance  of  the  labor  hours  regression  is  curiously 
inconsistent  with  the  very  large  variability  of  the  raw  data  within 
age-groups.  At  least  part  of  this  contradiction  can  be  explained  by 
the  efficiency  of  the  sample  design;  the  design  should  have  provided 
broader  representation  than  would  be  expected  from,  say,  simple  random 
sampling,  which  is  the  usual  data  collection  technique  for  regression 
analysis.  This  illustrates  one  of  the  several  considerations  surround¬ 
ing  data  collection  that  are  discussed  in  the  next  section  on  estimat¬ 
ing  relationships. 

ESTIMATING  RELATIONSHIPS  AND  SAMPLE  DESIGN 

Very  little  attention  in  statistical  literature  is  addressed  explic¬ 
itly  to  the  use  of  sampled  data  in  regression  analysis,  a  technique  often 
used  to  derive  estimating  relationships  for  military  cost  analysis.  There 
exists,  in  estimating  relationship  studies,  the  implied  assumption  that  the 
data  base  constitutes  a  sample  of  some  larger  population  (unless  the  regres¬ 
sion  is  simply  intended  to  describe  a  particular  set  of  points),  and  the 
main  concern  is  whether  that  sample  is  representative.  Moreover,  it  is 
simple  random  sampling  that  is  implied;  the  more  complicated  designs  (strat¬ 
ification,  unequal  probabilities,  etc.)  are  ignored  because  they  are  not 
generally  used  to  build  data  bases.  The  Intent  is  now  to  suggest  how  these 
designs  might  be  so  used  in  connection  with  least  squares  simple  linear  re¬ 
gression,  The  presentation  can  be  made  clearer  by  establishing  a  conceptual 
scheme  within  which  data  collection  can  be  described.  Accordingly,  the  data 
collection  process  will  be  divided  into  three  phases: 

* 

Jean  Mullery,  Aircraft  Maintenance  Cost  Research,  KC-135.  Director¬ 
ate  of  Budget,  Headquarters  Strategic  Air  Cotmand. 
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(1)  Partitioning  the  population.  The  population  ic  divided 
into  sub-populations  that  may  be  treated  either  as 
strata  or  clusters, 

(2)  Data  selection.  This  phase  includes  any  methods  of 
determining  what  data  will  fall  into  the  sample.  The 
data  from  any  given  sub -population  will  comprise  a  sub- 
sample.  In  clustering,  each  sub-sample  would  thus  in¬ 
clude  the  entire  sub-population,  whereas  only  a  portion 
would  be  included  with  stratification. 

(3)  Data  reduction.  The  data  in  each  sub-sample  are  reduced 
to  a  single  mean  value,  using  some  estimator  (simple 
mean,  ratio  estimator,  etc.)?  the  data  base  now  consists 
of  one  value  per  stratum,  or  one  value  per  cluster. 

Regression  can  be  performed  on  the  data  base  either  after  the  sec¬ 
ond  phase  (eliminating  reduction)  or  after  the  third  phase.  If  regres¬ 
sion  is  performed  after  the  second  phase,  there  are  two  alternatives 
available:  (1)  a  single  regression  on  all  data,  or  (2)  weighted  aver¬ 
ages  of  regression  coefficients  calculated  separately  from  each  sub-sample. 
The  following  flor  chart  characterizes  the  total  process  of  data  col¬ 
lection  and  subsequent  data  analysis: 


-  Data  collection - 

Phase  I  Phase  2  Phase  3 


Data  analysis  — 


There  appear  to  be  two  basic  motives  f~r  using  reduced  date  rather 
than  the  original  sample:  (1)  to  adjust  for  unequal  simple  slsae  In 
the  various  sub-populations,  and  (2)  to  utilise  the  special  estimators 
for  increased  accuracy. 


The  ratio  and  regression  estimators  will  be  referred  to  collec¬ 
tively  as  the  "special"  estimators  so  as  to  avoid  confusion  with  the 
use  of  regression  to  develop  forecasting  relationships . 
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I£  sample  sizes  within  sub-populations  are  unequal,  and  it  is  de¬ 
sired  that  all  <vub -populations  be  of  equal  importance  in  determining 
the  regression  line,  a  sort  of  "weighted"  regression  is  produced  by  us¬ 
ing  reduced  data;  each  sub -population  is  represented  by  the  same  number 
of  data  points,  namely,  one.  However,  if  it  is  preferable  to  weight  on 
the  basis  of  individual  observations  rather  than  sub -populations,  the 
data  should  not  be  reduced. 

If  the  opportunity  should  present  itself,  it  would  seem  prudent 
to  use  one  of  the  special  estimators  of  the  unequal  probability  estima¬ 
tor  to  reduce  the  data  within  each  sub -population.  These  estimators 
would  provide  mere  accurate  estimates  of  the  true  sub -population  moans, 
m  ,  hence  lead  to  more  accurate  regression  estimates.  However,  this 
accuracy  is  gained  at  the  cost  of  using  some  auxiliary  variable,  and 
it  might  be  preferable  to  use  this  variable  as  a  second  independent 
variable  in  the  regression  equation.  The  objective  side  of  deciding 
which  way  the  extra  variable  should  be  used  involves  the  usual  cost- 
precision  trade-off  (which  use  will  provide  greater  precision  for  a 
given  cost?).  On  the  subjective  side,  the  decision  might  be  governed 
by  whether  the  extra  variable  can  appropriately  be  specified  in  the 
estimating  relationship;  a  variable  might  be  closely  correlated  with 
the  Independent  variable  but  still  be  ruled  out  of  the  regression  model 
because  there  is  no  logical  causal  relationship,  or  because  its  future 
behavior  is  as  doubtful  as  the  dependent  variable.  In  either  of  these 
cases,  the  extra  variable  could  be  suitably  used  as  an  auxiliary  vari¬ 
able  in  a  special  estimator  or  unequal  probability  estimator.  When 
using  the  special  estimators,  each  sample  point  will  contain  three  kinds 
of  observ  tlons:  one  each  for  the  dependent  variable,  the  Independent 
variable,  and  the  auxiliary  variable.  In  unequal  probability  sampling, 
only  the  Independent  and  dependent  variable  will  be  observed  since  val¬ 
ues  for  the  auxiliary  variable  are  known  prior  to  sample  selection. 

The  stapling  techniques  that  have  been  discussed  in  this  paper 
can  be  categorized  into  the  three  phases  as  follows:* 


Note  that  In  using  this  manner  of  classification,  the  techniques 
of  stratifies, ion  end  clustering  Include  only  the  act  of  population 
partitioning;  the  functions  of  sample  point  selection  ami  estimation 
of  the  population  mean  fall  into  the  second  and  third  phases. 


Partitioning 


Selection 


Reduction 


Stratification 

.Clustering 

‘Simple  random  sampling 
Systematic  random  sampling 
-Sampling  with  unequal  probabilities 

Simple  mean 

Special  estimators  (ratio  and  regression) 


Sampling  with  unequal  probabilities  falls  into  two  categories  because 
selection  according  to  this  procedure  requires  the  subsequent  use  of 
the  unequal  probabilicy  estimator.  Sub-sampling  was  omitted  from  the 
list  since  it  is  really  a  hybrid  of  the  other  designs. 

Below  is  a  schematic  diagram  of  the  full  set  of  feasible  designs 
that  can  be  put  together  from  these  techniques. 
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The  selection  and  reduction  techniques  are  numbered  for  simplicity. 

(1)  Simple  random  sampling. 

(2)  Systematic  random  sampling. 

(3)  Unequal  probabilities. 

(4)  Simple  mean. 

(5)  Special  estimators. 

With  no  nartitioning,  there  is  no  alternative  to  simple  random  sampling 
and  regre-  on  the  non-reduced  data.  Partitioning, on  the  other  hand, 
allows  18  different  basic  designs,  i.e.,  there  are  18  paths  by  which 
the  final  regression  analysis  can  be  reached.  Any  other  design  would 
essentially  be  an  extension  of  those  above.  For  example,  a  schematic 
for  collection  procedures  based  on  sub-sampling  indicates  that  there 
is  simply  a  replication  of  the  selection  phase: 


Partition 

Selection  (I) 
(clusters) 


Selection  (II) 
(units  within  clusters) 


Reduction 


Clusters 


The  foregoing  has  provided  a  rather  cursory  treatment  of  the  prep¬ 
aration  of  estimating  relationships  from  sampled  data.  If  data  are 
gathered  by  simple  random  sampling,  subsequent  regression  analysis  is 
straightforward.  More  complex  schemes  lead  to  difficulties  in  inter¬ 
preting  regression  results.  For  example,  unequal  probability  sampling 
will  usually  lead  to  underestimated  prediction  intervals.  Stratifica¬ 
tion  will  sometimes  product  biased  regtesslon  coefficients.  These  prob- 
lems  fall  into  the  general  area  of  analytic  surveys  and  are  currently 
being  addressed  in  a  peripheral  way  by  such  men  as  Hartley,  Kish,  and 
Konljn. 
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VI.  SURVEY  PROCEDURE 

The  preceding  pages  have  provided  a  summary  of  those  aspects  of 
probability  sampling  that  relate  to  the  design  of  sample  selection  pro¬ 
cedures.  Some  peripheral  topics,  such  as  questionnaire  design  and  train¬ 
ing  of  field  workers,  have  been  Ignored  as  outside  the  scope  of  the 
paper  but  can  be  found  In  such  texts  as  Stephan  and  McCarthy;  quota 
sampling,  a  widely  used  non-probablllty  method.  Is  discussed  also  In 
Stephan  and  McCarthy,  and  In  Kish. 

The  following  generalized  chronology  of  a  sample  survey  is  Intended 
to  "wrap  things  up."  These  steps  amount  to  formalization  of  the  typi¬ 
cal  decisionmaking  proems ;  but  conscious  observance  of  them  is  essen¬ 
tial  to  the  mechanics  of  a  valid  survey,  thereby  forcing  a  rational 
approach  to  the  analysis. 

FORMULATE  THE  PROBLEM 

The  first  and  most  important  step  Is  to  identify  the  objectives 
in  a  rather  formalized  way  so  that  any  subsequent  planning  alternative 
can  be  clearly  evaluated  with  respect  tc  its  contribution  to  those  ob¬ 
jectives.  The  analyst  is  not  merely  seeking  information;  he  is  seeking 
information  that  will  eventually  become  part  of  the  basis  for  seme  spe¬ 
cific  decision  or  class  of  decisions.  It  would  be  well  to  itemise  the 
objectives  and,  as  far  as  possible,  to  model  the  eventual  decision 
process.  Where  the  survey  is  part  of  a  p~oup  effort,  it  is  squally  im¬ 
portant  to  clarify  each  person's  rols  and  to  astabliah  a  consensus  of 
group  objectives.  Having  defined  the  problem,  the  analyst  should  refer 
back  to  it  often  to  avoid  becoming  over -engrossed  on  the  details  of 
planning  and  administration. 

DEFIffl  T»  POPULATION 

The  objectives  of  the  wet tiger Ion  determine  the  populetlon  from 
which  information  is  d*  ired--the  target  population.  The  target  popu¬ 
lation  Is  often  different  from  that  actually  sampled.  Although  careful 
planning  will  tend  to  eliminate  this  difference,  there  are  some  situations 
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where  the  discrepancy  simply  cannot  be  physically  or  economically  re¬ 
solved.  The  obvious  example  is  the  use  of  historical  data  in  planning 
for  the  future.  Another  example  is  the  exclusion  of  portions  of  the 
population  that  are  too  inconvenient  to  sample.  Upper  limits  can  some¬ 
times  be  computed  for  the  bias  introduced,  but  usually  some  judgment 
must  be  exercised  to  evaluate  the  extent  to  which  the  sampled  population 
mirrors  the  target  population.  This  judgment  should  be  documented  in 
the  form  of  a  list  of  assumptions,  disclaimers,  and  uses  for  which  the 
survey  results  are  appropriate. 

SPECIFY  PRECISION 

The  specification  of  desired  precision  is  an  important  first  step 
in  the  design  of  a  survey,  although  this  specification  may  be  simply 
to  obtain  the  greatest  precision  for  a  given  budget.  In  any  case,  it 
would  seem  orudent  to  examine  the  survey  objectives  with  respect  to  the 
accuracy  required  in  the  estimates.  If  the  estimates  are  to  be  the 
bases  for  comparisons ,  it  makes  little  sense  for  them  to  have  greater 
precision  than  the  standards  against  which  they  are  compared.  Some  stud¬ 
ies  are  so  heavily  burdened  with  non-statistical  uncertainty  (e.g., 
poorly  documented  data  or  requirements  uncertainty)  that  high  precision 
may  be  superfluous . 

In  complex  efforts,  such  as  large  models  that  require  partitioning 
into  several  sub-models,  it  would  be  well  for  the  analysts  concerned  to 
discuss  together  the  precision  of  the  various  components  with  respect 
to  (1)  the  ultimate  use  of  the  model,  (2)  the  interrelationship  of  es¬ 
timates  within  the  model,  (3)  the  maximum  attainable  precision  for  the 
various  estimates,  and  (4)  budget  and  time  constraints.  The  logical 
time  for  such  discussion  would  be  after  the  model  has  been  designed  and 
preliminary  investigation  of  the  various  subject  areas  has  bean  accomplished. 

CONSTRUCT  A  FRAME 

To  construct  a  sampling  frame  is  to  divide  the  population  into 
sampling  unlta  (cluster,  strata,  and/or  simp!*'  population  units),  such 
that  every  element  of  the  population  belongs  to  one  and  only  one  unit. 
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The  frame  is  an  ordering  scheme,  or  list,  that  facilitates  consistent 
and  unbiased  selection  of  sampling  units  from  the  population.  If  more 
than  one  sample  design  is  under  consideration,  the  frame  must  be  flex¬ 
ible  so  as  to  suit  any  one  of  them  (e.g.,  the  population  might  be  divided 
into  strata  and  the  population  units  grouped  into  clusters). 

SELECT  A  SAMPLING  PLAN 

After  a  preliminary  investigation,  the  analyst  should  be  able  to 
identify  characteristics  of  the  population  that  can  be  used  to  design 
a  sampling  procedure  that  is  more  efficient  (less  variance  for  a  given 
sample  size)  than  a  simple  random  sample.  These  characteristics  should 
suggest  several  alternatives,  an  available  auxiliary  variable  may  lend 
itself  either  to  regression  estimation  or  to  sampling  with  unequal  prob¬ 
abilities.  The  alternatives  can  usually  be  narrowed  dowu  to  one  or  two 
by  making  a  priori  assumptions  about  the  different  sampling  variances 
and  costs.  It  may  be  necessary  to  make  the  final  choice  on  the  basis 
of  a  pre-test  that  would  try  out  the  various  plans  on  a  small  scale. 

CONDUCT  FIELD  WORK 

There  is  little  to  be  said  here,  provided  the  planning  has  been 
carefully  done.  However,  if  the  analyst  is  not  djing  his  owr  field 
work,  there  should  be  provisions  made  to  check  the  quality  of  the  data 
as  soon  as  it  starts  coming  in.  In  any  case,  the  analyst  should  do  his 
own  sample  point  selection;  the  field  worker  should  be  concerned  only 
with  collection.  There  should  also  be  a  procedure  drawn  up  to  handle 
non-response,  the  failure  of  some  selected  sample  point  to  be  available 
for  sampling. 


SUMMARY.  ANALYSIS.  AND  DOCUMENTATION 


The  data  should  be  examined  for  erroneoua  observations ,  and  the 
estimate*  calculated.  The  sampling  error  should  also  be  calculated, 
end  the  sampling  procedure  summarised. 

Aa  an  aid  to  future  surveys,  it  is  useful  to  make  a  detailed  sum¬ 
mary  of  the  sampling  procedure,  including  costs  that  were  encountered, 
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peculiar  sampling  problems,  and  characteristics  of  the  population,  such 
as  within-strata  variances,  These  might  help  later  surveys  by  giving 
more  confidence  to  a  priori  assumptions  and  eliminating  the  need  for 
pre-tests. 
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Appendix 

ESTIMATORS 

This  appendix  gathers  together  the  estimators  of  means  and  vari¬ 
ances  along  with  formulas  for  determining  sample  allocation.  The  vari¬ 
ous  designs  are  treated  in  the  same  order  as  in  Sections  III  and  IV. 

STRATIFIED  SAMPLING 

id  by  X  ,  the  weighted  average  of 
s  c 

number  of  strata 

sub-population  size  of  itn  stratum 
sample  mean  from  1^  stratum 
average  sub-population  size 


The  population  mean  is  estimate 
the  stratum  means: 


st 


N  M 

3&X 


N  = 
Mi  = 

v 

M  = 


The  estimate  Cor  sampling  variance  of  Xgt  is  also  a  "weighted"  average: 


N  M  2 

/W 


S*  -  ■  sampling  variance  from  itx  stratum 

xst  jV)iR7  *i  xi 


(  (Xil"Xi)  “ 


,th 


m^  “  sample  size  from  1  stratum. 


The  total  sample  may  be  divided  among  strata  according  to  the  pro¬ 
portional  a  1 location  scheme: 


If  estimates  are  available  for  the  variances  within  each  stratus 
coo  the  costs  of  sampling  from  each  stratum  ( c ^),  optimal  allocation 
may  be  used: 
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introduces  bias  of  the  type  associated  with  ratio  estimators.  The  es' 
timator  for  the  population  mean  is: 


n 

E  EX 


cl  n 

r  m 


il 


Sampling  variance  is  estimated  by 


4  -2 


1-f 


cl  n(M)  (n-1) 


n  2  2  n  1  —  n 

«i  +  Xcl*l  •  2XcimiXi 


f  .  s 

1  N 
M, 


,th 


Xi  *  DC^j  •  total  of  i  cluster 


M  •  average  cluster  size 


Since  X  is  based  on  the  ratio  of  two  variables  (X^  and  M^) ,  these  es¬ 
timators  have  a  bias  that  Increases  as  the  variability  of  the  in¬ 
creases.  As  a  rule  of  thumb,  the  bias  may  be  overlooked  when  the  co¬ 
efficient  of  variation  for  is  less  than  .2;  i.e., 


.2 


Otherwise,  the  sample  size  should  be  large  (see  Kish,  page  276). 


SUB-SAMTUNC 

The  population  mean  is  estimated  as  the  average  of  primary  means: 


The  sampling  variance  has  two  components,  one  representing  variation 


♦ 


r^«tv  •  '•,JV'C 


betvegn  primaries  and  one  representing  variation  among  secondaries 
within  primaries.  The  estimator  is: 


4  '  +  S^'V4! 


4  ■  sir  *V»2 


2  ,  rm  _ 

Sw  "  S(h>  S(VXi> 


£  ■  —  ,  f  »  - 

1  M  *  2  M 


X^j  is  the  j**1  sample  unit  in  the  ith  cluster. 


The  optimal  number  of  secondaries  to  select  from  each  primary  is 
given  by: 


Vs’.i  '2 

"  B  M 

Cj  ■  cost  of  sampling  one  primary 
("fixed"  cost) 

Cg  ■  cost  of  sampling  each  secondary 
("variable"  cost) 

If  the  value  computed  is  equal  to  1  or  less,  then  m  •  1;  if  the  value 
is  greater  than  M,  one-stage  (cluster)  sampling  should  be  used. 

The  determination  of  n,  the  number  of  primaries  selected,  depends 
on  whether  total  cost  or  precision  is  to  be  held  constant.  In  the  lat¬ 
ter  case,  n  is  found  by  solving  the  sampling  variance  formula.  If  total 
tost,  C,  is  fixed,  the  following  formula  is  solved  for  n: 


C  ■  nCj  + 
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The  use  of  the  foregoing  methods  for  determining  n  and  m  requires 

2  2 

preliminary  estimates  of  c^,  Sg,  and  S^.  For  this  purpose,  these 

estimates  do  not  require  great  precision  because  the  sampling  variance 

is  not  highly  sensitive  to  the  choice  of  m.  It  is  usually  easier  to 

estimate  ratios  c./c.  and  S  /S_,  in  which  case  tables  are  available  to 
1  ^  w  B 

aid  the  evaluation  of  m  (see  Cochran,  page  282). 


Unequal  Cluster  Sizes 


As  in  simple  cluster  sampling,  sub-sampling  becomes  more  diffi¬ 
cult  if  cluster  size,  M^,  is  variable.  An  estimator  of  the  population 
mean  is: 


n 


n 


XM. 


An  estimator  of  the  sampling  variance  is 


m. 

{  m  —  f  »  — 

rl  N  *  X2i  M£ 


M  *  average  cluster  size. 


The  bias  in  these  estimators  again  relates  to  the  variability  of  M^ , 
and  can  be  made  negligible  by  making  n  large  (see  Cochran,  page  300). 


SYSTEMATIC  SAMPLING 

The  estimator  of  X  in  systematic  sampling  is  the  same  as  for  simple 


random  sampling: 


n 
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There  is  no  single  reliable  method  ~>f  estimating  the  sampling  vari¬ 
ance  because  so  much  depends  on  the  way  the  population  is  listed.  Var¬ 
iance  formulas  for  specific  kinds  of  populations  can  be  found  in  sampling 
t»mts. 

RATIO  ESTIMATOR 

The  ratio  estimate  of  the  population  mean  of  the  X  is  given  by: 


n 


For  large  sample  sizes,  the  approximate  sampling  variance  is  estimated 


REGRESSION  ESTIMATOR 

The  regression  estimate  of  the  population  mean  of  the  X  variable 
is  given  by: 


X  -  X  +  b(n  -  W) 
r  w 

The  least-squares  estimator  for  b  is: 

n 

E(Xi-X)(Wi-W) 

2(wrW)2 

The  sampling  variance  for  large  samples  Is  estimated  by: 


4  - 
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UNEQUAL  PROBABILITY  SAMPLING 

The  estimate  of  the  population  mean  is  provided  by: 


^  y 

*P  nN  L  P, 


where  p^  is  the  probability  of  selecting  X^.  The  variance  for  replace¬ 
ment  sampling  is  estimated  by: 


X.y2 


si 

xPr  n(n-l)Z_.\p  n  L  p  / 


The  variance  for  non- rep la cement  sampling  can  be  estimated  by: 


si  -y 

XNPR  L> 

i>j 


(Vi  -  *«)£  -  xi) 


where  and  P^  are  the  respective  probabilities  with  which  Y^  and  Yj 
were  included  in  the  sample,  and  P^  is  the  joint  probability  with 
which  both  Y^  and  Y^  were  included.  This  estimator  is  unbiased  if  P^ 
is  non-zero  for  all  i  and  j . 

Non-replacement  sampling  offers  the  same  kind  of  efficiency  ad¬ 
vantages  for  unequal  probability  sampling  as  in  sampling  with  equal 
probabilities,  and  is  therefore  widely  used.  However,  while  the  bene¬ 
fits  with  equal  probabilities  are  reflected  in  the  finite  population 
correction  factor  (page  23),  the  corresponding  theory  for  unequal  prob¬ 
abilities  finds  no  such  simple  expression.  A  "unified,"  or  all- 
inclusive  theory  has  not  been  forthcoming;  current  literature  is  gener¬ 
ally  focused  on  various  aspects  of  three  related  problems:  (1)  the 
control  of  the  Pt  through  selection  procedures,  (2)  the  control  of  the 
P^  through  selection  procedures,  and  (3)  the  conditions  under  which 
estimates  behave  according  to  the  central  ltait  theorem  (i.e.,  tend  to 
be  normally  distributed). 
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